Write Path (The Daemon)¶
The daemon (python -m app.pipeline) is the only thing in the system that writes.
It's one blocking loop, and every pass through it runs the whole pipeline start to
finish.
Tick Order¶
flowchart LR
a["1 · Reset check"] --> b["2 · Roster<br/>+ backfill"] --> c["3 · Drain<br/>ingest"]
c --> d["4 · Recompute<br/>dirty workers"] --> e["5 · Evaluate<br/>triggers"] --> f["6 · Adaptive<br/>sleep"]
f -. "next tick" .-> a
Each tick moves through these stages in order:
- Reset check. Look at the
meta.reset_requested_atflag, and if it changed, drop the in-memory workers. - Roster and backfill. Reconcile the tracked allowlist and backfill any student who was just added.
- Drain (ingest). Pull everything new since the cursor and persist it idempotently.
- Recompute dirty workers. Re-run inference once per student who got events this tick.
- Evaluate triggers. One sweep over all students to open and resolve intervention flags.
- Adaptive sleep. Wait for the current poll interval, with idle backoff applied.
Client And Polling¶
The client is a normal authenticated REST client (token auth, a keep-alive session, re-auth on a 401). It has two backoffs that do different jobs:
- Idle backoff. 0.5s when things are active, growing up to
PIPELINE_IDLE_MAX(5s) when nothing's happening. Any activity resets it. This is what keeps load off prod. - Failure backoff. Exponential up to 30s when requests error out, and it logs
UNHEALTHYafter five failures in a row. This is just resilience.
Tip
Poll load tracks event volume, not how many students you're tracking. Thanks to the backoff, a quiet cohort barely touches prod no matter how big the roster is.
Cursor And Idempotency¶
This is the part that makes a restart lossless, and it's the most important bit of correctness machinery in the whole thing.
- The cursor is a timestamp (
last_event_time) pluslast_source_id. - Each drain pages prod with
dateFrom = last_event_time - overlap, where overlap is a 2-second window, so events sitting right on a timestamp boundary don't slip through. - It persists, then advances. The cursor only moves after a full drain is safely written.
- Inserts are idempotent. Every event has a unique
source_event_id, so re-fetched overlap events get dropped (there's an existence check, plus aUNIQUEconstraint to catch races).
Put it together and a crash mid-drain is a non-event: on restart it just re-fetches the overlap and de-dupes. At-least-once delivery plus dedup gives you effectively-once processing, with nothing lost.
Roster Allowlist And Backfill¶
The daemon only ingests and computes students on the tracked_student allowlist.
When you add a student, that kicks off a one-time backfill of their recent history
(separate from the cursor) so their card fills in within a tick or two.
Per-Student Workers¶
Every tracked student gets a StudentWorker that holds a rolling
deque(maxlen=5000) of recent events.
- Debounced recompute. A
dirtyflag means a worker recomputes once per tick, no matter how many events landed. - HMM re-decode only on a new run. The HMM's unit is the
runProject, so whenhad_new_runis false, non-run events just reuse the cached decoding. - Rehydrate on cold start. If a worker is missing, it reloads its tail from
vex_log(the one SQL read on the hot path). In-memory state is lost on restart, but it's reconstructed straight from the log.
Inference¶
compute_strategy_states runs once per runProject:
- Extract the block AST. Parse the student's current blocks into a tree.
- Compute
change_score. APTED tree-edit-distance between this run and the last one, with a hashed-pair cache so you don't recompute the same comparison. - Bucket and decode. Bucket the score, then feed the HMM (
model.pkl, loaded lazily) to get a latent state: iterator, explorer, or stuck.
flowchart LR
run["runProject"] --> ast["Block AST"]
ast --> cs["change_score<br/>APTED edit-distance<br/>vs previous run"]
cs --> bucket["bucket"]
bucket --> hmm["HMM decode<br/>model.pkl"]
hmm --> state["Latent state<br/>iterator · explorer · stuck"]
The HMM moves a student between those three strategy states run to run, and the stuck state is exactly what the wheel-spin flag watches:
stateDiagram-v2
direction LR
Iterator: Iterator (steady, incremental)
Explorer: Explorer (structural changes)
Stuck: Stuck (wheel-spinning)
Iterator --> Explorer
Explorer --> Iterator
Explorer --> Stuck
Stuck --> Explorer
Iterator --> Stuck
Stuck --> Iterator
On top of strategy, every tick also segments the session into episodes (that's the
vendored, dependency-free app/episode_engine package) and builds a "playground" LLM
prompt describing the current blocks. The Read path page covers how
all of this surfaces.
Triggers¶
Triggers run as a per-tick sweep, with their lifecycle stored in trigger_event.
There are two kinds:
- Sustained (wheel-spin, inactive). Open while the condition holds, resolve when
it clears. For wheel-spin,
started_atis the timestamp of the first run in the current stuck streak, not the tick that happened to notice it, so the alert's age matches what the student actually went through. - Momentary (big-rewrite). Fires once per qualifying run, deduped with
json_extract(detail,'$.run_index'). It's raised straight from the worker the instant a run is decoded, and the dedupe set is seeded from the database on rehydrate, so a backfill or a restart can't quietly drop or repeat alerts for runs in the middle.
A sustained trigger moves through this lifecycle (re-alert is covered just below):
stateDiagram-v2
direction LR
[*] --> Open: condition starts
Open --> Resolved: condition clears
Open --> Acked: researcher acks
Acked --> Resolved: condition clears
Acked --> Open: still holding after RE_ALERT_SECONDS
Resolved --> [*]
Same Model, Opposite Sides
Wheel-spinning reads the HMM output (current_state == 2), while big-rewrite
reads the raw change_score, which is the HMM's input feature, with its own
threshold of 0.5. They sit on opposite sides of the model.
Re-Alert On Persistent Conditions¶
Acking a sustained trigger doesn't silence it forever. If the condition keeps holding
for another RE_ALERT_SECONDS (10 minutes) past the acked row's started_at, the
evaluator closes that row and opens a fresh, unacked one. So a student who genuinely
stays stuck keeps coming back to the feed instead of disappearing after a single ack.