feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming

Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-25 21:01:27 +08:00
parent cbb943dfe5
commit dbe0da057c
32 changed files with 3140 additions and 87 deletions

View File

@@ -27,6 +27,14 @@ StandardOutput=journal
StandardError=journal
SyslogIdentifier=mes-dashboard
# Memory protection (cgroup v2): prevents OOM from killing entire VM.
# MemoryHigh: soft limit — kernel starts reclaiming when exceeded (service stays alive).
# MemoryMax: hard limit — OOM kills only this service (host OS survives).
# Adjust based on VM RAM: MemoryHigh ≈ 70% of VM RAM, MemoryMax ≈ 85% of VM RAM.
# For 7GB VM: MemoryHigh=5G, MemoryMax=6G (leaves ~1GB for OS + Redis).
MemoryHigh=5G
MemoryMax=6G
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict