Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2.4 KiB
2.4 KiB
1. Admission Control (profile-aware)
- 1.1 Add
TRACE_EVENTS_CID_LIMITenv var (default 50000) totrace_routes.py - 1.2 Add CID count check in
events()endpoint: for non-MSD profiles, iflen(container_ids) > TRACE_EVENTS_CID_LIMIT, return HTTP 413 with{ "code": "CID_LIMIT_EXCEEDED", "cid_count": N, "limit": M } - 1.3 For MSD profile: bypass CID hard limit, log warning when CID count > 50000
- 1.4 Add unit tests: non-MSD CID > limit → 413; MSD CID > limit → proceeds normally
2. Batch Fetch (fetchmany) in database.py
- 2.1 Add
read_sql_df_slow_iter(sql, params, timeout_seconds, batch_size)generator function todatabase.pythat yields(columns, rows)tuples usingcursor.fetchmany(batch_size) - 2.2 Add
DB_SLOW_FETCHMANY_SIZEtoget_db_runtime_config()(default 5000) - 2.3 Add unit test for
read_sql_df_slow_iter(mock cursor, verify fetchmany calls and yields)
3. EventFetcher Memory Optimization
- 3.1 Modify
_fetch_batchinevent_fetcher.pyto useread_sql_df_slow_iterinstead ofread_sql_df— iterate rows directly, skip DataFrame, group togroupeddict immediately - 3.2 Update
_sanitize_recordto work withdict(zip(columns, row))instead ofrow.to_dict() - 3.3 Add unit test verifying EventFetcher uses
read_sql_df_slow_iterimport - 3.4 Update existing EventFetcher tests (mock
read_sql_df_slow_iterinstead ofread_sql_df)
4. trace_routes Memory Optimization
- 4.1 Modify events endpoint: only keep
raw_domain_results[domain]for MSD profile; for non-MSD,del events_by_cidafter flattening - 4.2 Verify existing
del raw_domain_resultsandgc.collect()logic still correct after refactor
5. Deployment Configuration
- 5.1 Update
.env.example: addTRACE_EVENTS_CID_LIMIT,DB_SLOW_FETCHMANY_SIZEwith descriptions - 5.2 Update
.env.example: changeGUNICORN_WORKERSdefault comment to recommend 2 for ≤ 8GB RAM - 5.3 Update
.env.example: changeTRACE_EVENTS_MAX_WORKERSandEVENT_FETCHER_MAX_WORKERSdefault to 2 - 5.4 Update
deploy/mes-dashboard.service: addMemoryHigh=5GandMemoryMax=6G - 5.5 Update
deploy/mes-dashboard.service: add comment explaining memory limits
6. Verification
- 6.1 Run
python -m pytest tests/ -v— all existing tests pass (1069 passed, 152 skipped) - 6.2 Verify
.env.exampleenv var documentation is consistent with code defaults