Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2.6 KiB
2.6 KiB
1. Dependencies
- 1.1 Add
rq>=1.16.0,<2.0.0torequirements.txt - 1.2 Add
rq>=1.16.0,<2.0.0to pip dependencies inenvironment.yml
2. Trace Job Service
- 2.1 Create
src/mes_dashboard/services/trace_job_service.pywithenqueue_trace_events_job(),get_job_status(),get_job_result() - 2.2 Implement
execute_trace_events_job()function (RQ worker entry point): runs EventFetcher + optional MSD aggregation, stores result in Redis with TTL - 2.3 Add job metadata tracking:
trace:job:{job_id}:metaRedis key with{profile, cid_count, domains, status, progress, created_at, completed_at} - 2.4 Add unit tests for trace_job_service (13 tests: enqueue, status, result, worker execution, flatten)
3. Async API Endpoints
- 3.1 Modify
events()intrace_routes.py: whenlen(container_ids) > TRACE_ASYNC_CID_THRESHOLDand async available, callenqueue_trace_events_job()and return HTTP 202 - 3.2 Add
GET /api/trace/job/<job_id>endpoint: return job status fromget_job_status() - 3.3 Add
GET /api/trace/job/<job_id>/resultendpoint: return job result fromget_job_result()with optionaldomain,offset,limitquery params - 3.4 Add rate limiting to job status/result endpoints (60 req/60s)
- 3.5 Add unit tests for async endpoints (8 tests: async routing, sync fallback, 413 fallback, job status/result)
4. Deployment
- 4.1 Create
deploy/mes-dashboard-trace-worker.servicesystemd unit (MemoryHigh=3G, MemoryMax=4G) - 4.2 Update
scripts/start_server.sh: addstart_rq_worker/stop_rq_worker/rq_worker_statusfunctions - 4.3 Update
scripts/deploy.sh: add trace worker systemd install instructions - 4.4 Update
.env.example: uncomment and addTRACE_WORKER_ENABLED,TRACE_ASYNC_CID_THRESHOLD,TRACE_JOB_TTL_SECONDS,TRACE_JOB_TIMEOUT_SECONDS,TRACE_WORKER_COUNT,TRACE_WORKER_QUEUE
5. Frontend Integration
- 5.1 Modify
useTraceProgress.js: detect async response (eventsPayload.async === true), switch to job polling mode - 5.2 Add
pollJobUntilComplete()helper: pollGET /api/trace/job/{job_id}every 3s, max 30 minutes - 5.3 Add
job_progressreactive state for UI:{ active, job_id, status, elapsed_seconds, progress } - 5.4 Add error handling: job failed (
JOB_FAILED), polling timeout (JOB_POLL_TIMEOUT), abort support
6. Verification
- 6.1 Run
python -m pytest tests/ -v— 1090 passed, 152 skipped - 6.2 Run
cd frontend && npm run build— frontend builds successfully - 6.3 Verify rq installed:
python -c "import rq; print(rq.VERSION)"→ 1.16.2