Phase 2 Platform Architecture#
This page describes the three shared platform pieces that landed in Stage 15
of the Phase 2 plan and that every subsequent V2 stage (Groups, Mentorship,
Chapter Events, Search/Recommendations, Marketplace) consumes. Source of
truth for behaviour is plans/BITSAPP-PHASE2-PLAN.md §0; this page is the
architectural overview that the plan’s §0.3 DoD item calls for.
As of the Phase 2 GA cut, all three components run unconditionally — the
Stage 15-era BITS_PHASE2_ENABLED gate has been removed.
1. Membership facade#
Package: backend/internal/membership/
Django app: schema/memberships/ (table: memberships_membership)
The Phase 1 codebase grew several ad-hoc “is this user in this scope?” checks
in feed, calendar, and directory — each with its own join and its own
audience semantics. Phase 2 adds two more scope kinds (groups and mentorship
programmes), and continuing down that path was a bug factory waiting to
happen.
The membership facade collapses every membership lookup to a single tuple:
(user_id, scope_type, scope_id, role, state)with scope_type ∈ {batch, chapter, group, mentorship_programme} and a
soft-delete state machine (active | banned | left). Constants live in
membership/types.go so callers go through membership.ScopeBatch etc.
rather than literal strings.
Why one table, not one per scope#
The alternative — a separate join table per scope kind — was rejected because:
- The hot path is “given (user, scope_type, scope_id), is the user in?”.
That’s a single index lookup against
(user_id, scope_type, scope_id)regardless of cardinality. - The Stage 19 search index needs to compute audience tags by joining
membership across all scope kinds; a unified table makes this a single
SELECTinstead of a six-wayUNION ALL. - Django migrations stay small — one app, one migration per ALTER, not six.
What it doesn’t do#
membership is not an authorization library. It answers IsMember(...),
Role(...), and ListByScope(...) against the storage. Capability checks
(can this user create a group? approve a mentorship match?) live in
backend/internal/auth/capabilities.go, layered on top.
AddMember is an upsert that resets state to active — so re-inviting a
banned user is an explicit admin action, not an accidental side-effect of
re-running a join flow. RemoveMember is a soft delete (state=left) so
the audit trail survives.
Migration from Phase 1#
The first PR of Stage 16 backfilled memberships_membership from
accounts_user.primary_role (for batch scope) and the existing chapter
inference. The legacy ad-hoc checks in feed/calendar/directory were
rewritten to call membership.IsMember(ctx, userID, "batch", batchID) in
the same PR — no behaviour change, just one line per call site.
2. Background job runner#
Package: backend/internal/jobs/runner/
Django app: schema/jobs_queue/ (table: jobs_queue_job)
Ops entrypoint: adminctl jobs run --queue=<name>
Phase 1 had exactly one async worker: notifications delivery, polling its
own table with FOR UPDATE SKIP LOCKED. Phase 2 adds three more async
workloads (mentorship matching, search index reconciliation, marketplace
deal expiry) and a fourth (recommender batch job) is on deck. Copy-pasting
the notifications pattern four times was a non-starter.
The runner package extracts that pattern into a generic Postgres-backed
queue. Single table, single state machine:
pending → running → succeeded
↘ failed (re-enqueued with backoff up to MaxAttempts)
↘ dead (terminal — surfaces in admin queue dashboard)Why Postgres, not Redis or Kafka#
The RFP scale (180k users, 8–10k/yr growth) does not justify a second
stateful service. A Postgres queue with FOR UPDATE SKIP LOCKED claim
batches handles the V2 load (~10k mentorship-matching jobs/run, ~200k
search-document upserts/run) comfortably on the same Postgres that owns
the rest of the data — and keeps “what work is queued?” answerable with a
single SELECT. We can swap to Kafka if scale warrants; the Handler
interface is the only thing that pins us today.
Handler interface#
type Handler interface {
Handle(ctx context.Context, job Job) error
}Returning a non-nil error triggers retry/backoff; returning nil marks the
row succeeded. Handlers MUST be idempotent — the runner guarantees
at-least-once, not exactly-once. The mentorship matching handler, for
example, is idempotent because matches use ON CONFLICT DO NOTHING on
(programme_id, mentor_user_id, mentee_user_id).
Two ways to run it#
- In-process worker:
bits-backendstarts a default worker on boot. This is the production path. - Out-of-process:
adminctl jobs run --queue=<name>runs a worker against a specific queue name. Useful for ops backfills, single-run debugging, or pinning a heavy queue (recommender) to a separate node.
Both share the same dispatch table, registered via
runner.RegisterHandler(kind, handler) at startup.
3. Search infrastructure#
Package: backend/internal/search/
Django app: schema/search/
Stage 15 ships the storage and indexing primitives; Stage 19 layers the RPC surface and the recommender job on top. The split is intentional — we wanted Stages 16/17/20 to be able to populate the index in any order against a stable storage contract, without waiting for the read-side API to land.
Three primitives#
Indexer(index.go) — owns writes againstsearch_document.Upsert(doc),Delete(surface, entity_id),BulkUpsert([]doc). The indexer encapsulates thesetweightpolicy (setweight(to_tsvector('english', title), 'A') || setweight(..., body, 'B')) so callers don’t have to know that titles outrank body hits at search time.Searcher(search.go) — read-side.plainto_tsqueryagainst the merged index, with anaudience_tags && $N::text[]overlap filter computed from the caller’s roles and memberships, keyset-paginated by(rank, id).SignalRecorder/CandidateStore(signals.go,recommender.go) — append-only signal log + replace-set candidate store, both consumed by the Stage 19 recommender batch job.
Surfaces#
Seven surface name constants (search/types.go), each a string key on
search_document.surface:
directory— people search.feed— Phase 1 social feed posts.jobs— alumni job board entries.groups— interest groups + alumni chapters (Stage 16).events— calendar events (chapter + campus).announcements— CMS announcements (Stage 18).marketplace— partner deals (Stage 20).
Adding a surface is a two-file change: the Go constant in
search/types.go and the Django SURFACE_CHOICES enum in
schema/search/models.py. The pair is checked by hand — no codegen yet
because the set rarely changes.
Audience tags#
search_document.audience_tags is a text[] column populated freeform by
each writer, conventionally formatted as <scope>:<value>:
alumni,students,faculty,staff— role-only scoping.batch:2018,batch:2024.chapter:<group_uuid>,group:<group_uuid>.institute— anyone with a verified BITS account.
Searcher.Search applies && $N::text[] overlap against the caller’s
computed tag set. The format is a convention, not a constraint — the only
hard requirement is symmetry between the writer’s tag values and the
audience-tag computation in the search request path.
Why no Elasticsearch#
We considered it. The driver isn’t query expressiveness — it’s index freshness. Stage 19 needs writes to be visible to search the moment a mutating RPC commits, so callers can immediately see their newly-created group or post in search results. That’s a transactional guarantee Postgres gives for free (the index upsert sits inside the same transaction as the row write); replicating it across Elasticsearch would need an outbox + reconciler pattern, which is more moving parts than a 180k-user dataset warrants.
We can revisit when the fleet sees evidence of multi-language tokeniser needs or sub-50ms p95 ranked retrieval requirements that Postgres tsvector can’t hit. Neither is on the V2 timeline.
How the three fit together#
A typical V2 mutating flow — say, “a user creates a group post” — exercises all three:
- The handler calls
membership.IsMember(ctx, userID, "group", groupID)to authorize the write. - The post row is inserted, and inside the same transaction the
search/Indexer.Upsertwrites asearch_documentrow so the post is immediately searchable. - After commit, the handler enqueues a
runner.Enqueue(...)job to fan out the notification (so the slow Meta WhatsApp call doesn’t block the RPC reply).
Audit logging and structured telemetry are layered above all three — those are interceptor-level concerns shared with Phase 1 and not specific to the Stage 15 additions.