Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1.0 KiB
1.0 KiB
MODIFIED Requirements
Requirement: EventFetcher SHALL use streaming fetch for batch queries
EventFetcher._fetch_batch SHALL use read_sql_df_slow_iter (fetchmany-based iterator) instead of read_sql_df (fetchall + DataFrame) to reduce peak memory usage.
Scenario: Batch query memory optimization
- WHEN EventFetcher executes a batch query for a domain
- THEN the query SHALL use
cursor.fetchmany(batch_size)(env:DB_SLOW_FETCHMANY_SIZE, default: 5000) instead ofcursor.fetchall() - THEN rows SHALL be converted directly to dicts via
dict(zip(columns, row))without building a DataFrame - THEN each fetchmany batch SHALL be grouped into the result dict immediately, allowing the batch rows to be garbage collected
Scenario: Existing API contract preserved
- WHEN EventFetcher.fetch_events() returns results
- THEN the return type SHALL remain
Dict[str, List[Dict[str, Any]]](grouped by CONTAINERID) - THEN the result SHALL be identical to the previous DataFrame-based implementation