Phase 2 Platform Architecture#

This page describes the three shared platform pieces that landed in Stage 15 of the Phase 2 plan and that every subsequent V2 stage (Groups, Mentorship, Chapter Events, Search/Recommendations, Marketplace) consumes. Source of truth for behaviour is plans/BITSAPP-PHASE2-PLAN.md §0; this page is the architectural overview that the plan’s §0.3 DoD item calls for.

As of the Phase 2 GA cut, all three components run unconditionally — the Stage 15-era BITS_PHASE2_ENABLED gate has been removed.


1. Membership facade#

Package: backend/internal/membership/ Django app: schema/memberships/ (table: memberships_membership)

The Phase 1 codebase grew several ad-hoc “is this user in this scope?” checks in feed, calendar, and directory — each with its own join and its own audience semantics. Phase 2 adds two more scope kinds (groups and mentorship programmes), and continuing down that path was a bug factory waiting to happen.

The membership facade collapses every membership lookup to a single tuple:

(user_id, scope_type, scope_id, role, state)

with scope_type ∈ {batch, chapter, group, mentorship_programme} and a soft-delete state machine (active | banned | left). Constants live in membership/types.go so callers go through membership.ScopeBatch etc. rather than literal strings.

Why one table, not one per scope#

The alternative — a separate join table per scope kind — was rejected because:

  • The hot path is “given (user, scope_type, scope_id), is the user in?”. That’s a single index lookup against (user_id, scope_type, scope_id) regardless of cardinality.
  • The Stage 19 search index needs to compute audience tags by joining membership across all scope kinds; a unified table makes this a single SELECT instead of a six-way UNION ALL.
  • Django migrations stay small — one app, one migration per ALTER, not six.

What it doesn’t do#

membership is not an authorization library. It answers IsMember(...), Role(...), and ListByScope(...) against the storage. Capability checks (can this user create a group? approve a mentorship match?) live in backend/internal/auth/capabilities.go, layered on top.

AddMember is an upsert that resets state to active — so re-inviting a banned user is an explicit admin action, not an accidental side-effect of re-running a join flow. RemoveMember is a soft delete (state=left) so the audit trail survives.

Migration from Phase 1#

The first PR of Stage 16 backfilled memberships_membership from accounts_user.primary_role (for batch scope) and the existing chapter inference. The legacy ad-hoc checks in feed/calendar/directory were rewritten to call membership.IsMember(ctx, userID, "batch", batchID) in the same PR — no behaviour change, just one line per call site.


2. Background job runner#

Package: backend/internal/jobs/runner/ Django app: schema/jobs_queue/ (table: jobs_queue_job) Ops entrypoint: adminctl jobs run --queue=<name>

Phase 1 had exactly one async worker: notifications delivery, polling its own table with FOR UPDATE SKIP LOCKED. Phase 2 adds three more async workloads (mentorship matching, search index reconciliation, marketplace deal expiry) and a fourth (recommender batch job) is on deck. Copy-pasting the notifications pattern four times was a non-starter.

The runner package extracts that pattern into a generic Postgres-backed queue. Single table, single state machine:

pending → running → succeeded
                 ↘ failed (re-enqueued with backoff up to MaxAttempts)
                 ↘ dead   (terminal — surfaces in admin queue dashboard)

Why Postgres, not Redis or Kafka#

The RFP scale (180k users, 8–10k/yr growth) does not justify a second stateful service. A Postgres queue with FOR UPDATE SKIP LOCKED claim batches handles the V2 load (~10k mentorship-matching jobs/run, ~200k search-document upserts/run) comfortably on the same Postgres that owns the rest of the data — and keeps “what work is queued?” answerable with a single SELECT. We can swap to Kafka if scale warrants; the Handler interface is the only thing that pins us today.

Handler interface#

type Handler interface {
    Handle(ctx context.Context, job Job) error
}

Returning a non-nil error triggers retry/backoff; returning nil marks the row succeeded. Handlers MUST be idempotent — the runner guarantees at-least-once, not exactly-once. The mentorship matching handler, for example, is idempotent because matches use ON CONFLICT DO NOTHING on (programme_id, mentor_user_id, mentee_user_id).

Two ways to run it#

  • In-process worker: bits-backend starts a default worker on boot. This is the production path.
  • Out-of-process: adminctl jobs run --queue=<name> runs a worker against a specific queue name. Useful for ops backfills, single-run debugging, or pinning a heavy queue (recommender) to a separate node.

Both share the same dispatch table, registered via runner.RegisterHandler(kind, handler) at startup.


3. Search infrastructure#

Package: backend/internal/search/ Django app: schema/search/

Stage 15 ships the storage and indexing primitives; Stage 19 layers the RPC surface and the recommender job on top. The split is intentional — we wanted Stages 16/17/20 to be able to populate the index in any order against a stable storage contract, without waiting for the read-side API to land.

Three primitives#

  1. Indexer (index.go) — owns writes against search_document. Upsert(doc), Delete(surface, entity_id), BulkUpsert([]doc). The indexer encapsulates the setweight policy (setweight(to_tsvector('english', title), 'A') || setweight(..., body, 'B')) so callers don’t have to know that titles outrank body hits at search time.
  2. Searcher (search.go) — read-side. plainto_tsquery against the merged index, with an audience_tags && $N::text[] overlap filter computed from the caller’s roles and memberships, keyset-paginated by (rank, id).
  3. SignalRecorder / CandidateStore (signals.go, recommender.go) — append-only signal log + replace-set candidate store, both consumed by the Stage 19 recommender batch job.

Surfaces#

Seven surface name constants (search/types.go), each a string key on search_document.surface:

  • directory — people search.
  • feed — Phase 1 social feed posts.
  • jobs — alumni job board entries.
  • groups — interest groups + alumni chapters (Stage 16).
  • events — calendar events (chapter + campus).
  • announcements — CMS announcements (Stage 18).
  • marketplace — partner deals (Stage 20).

Adding a surface is a two-file change: the Go constant in search/types.go and the Django SURFACE_CHOICES enum in schema/search/models.py. The pair is checked by hand — no codegen yet because the set rarely changes.

Audience tags#

search_document.audience_tags is a text[] column populated freeform by each writer, conventionally formatted as <scope>:<value>:

  • alumni, students, faculty, staff — role-only scoping.
  • batch:2018, batch:2024.
  • chapter:<group_uuid>, group:<group_uuid>.
  • institute — anyone with a verified BITS account.

Searcher.Search applies && $N::text[] overlap against the caller’s computed tag set. The format is a convention, not a constraint — the only hard requirement is symmetry between the writer’s tag values and the audience-tag computation in the search request path.

Why no Elasticsearch#

We considered it. The driver isn’t query expressiveness — it’s index freshness. Stage 19 needs writes to be visible to search the moment a mutating RPC commits, so callers can immediately see their newly-created group or post in search results. That’s a transactional guarantee Postgres gives for free (the index upsert sits inside the same transaction as the row write); replicating it across Elasticsearch would need an outbox + reconciler pattern, which is more moving parts than a 180k-user dataset warrants.

We can revisit when the fleet sees evidence of multi-language tokeniser needs or sub-50ms p95 ranked retrieval requirements that Postgres tsvector can’t hit. Neither is on the V2 timeline.


How the three fit together#

A typical V2 mutating flow — say, “a user creates a group post” — exercises all three:

  1. The handler calls membership.IsMember(ctx, userID, "group", groupID) to authorize the write.
  2. The post row is inserted, and inside the same transaction the search/Indexer.Upsert writes a search_document row so the post is immediately searchable.
  3. After commit, the handler enqueues a runner.Enqueue(...) job to fan out the notification (so the slow Meta WhatsApp call doesn’t block the RPC reply).

Audit logging and structured telemetry are layered above all three — those are interceptor-level concerns shared with Phase 1 and not specific to the Stage 15 additions.