feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming

Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00
parent cbb943dfe5
commit dbe0da057c
32 changed files with 3140 additions and 87 deletions
--- a/.env.example
+++ b/.env.example
@@ -85,7 +85,9 @@ LOCAL_AUTH_PASSWORD=
 # Server bind address and port
 GUNICORN_BIND=0.0.0.0:8080

-# Number of worker processes (recommend: 2 * CPU cores + 1)
+# Number of worker processes
+# Recommend: 2 for ≤ 8GB RAM (trace queries consume 2-3 GB peak per worker)
+# Recommend: 4 for ≥ 16GB RAM
 GUNICORN_WORKERS=2

 # Threads per worker
@@ -168,14 +170,55 @@ CIRCUIT_BREAKER_WINDOW_SIZE=10
 TRACE_SLOW_THRESHOLD_SECONDS=15

 # Max parallel workers for events domain fetching (per request)
-TRACE_EVENTS_MAX_WORKERS=4
+# Recommend: 2 (each worker × EVENT_FETCHER_MAX_WORKERS = peak slow query slots)
+TRACE_EVENTS_MAX_WORKERS=2

 # Max parallel workers for EventFetcher batch queries (per domain)
-EVENT_FETCHER_MAX_WORKERS=4
+# Recommend: 2 (peak concurrent slow queries = TRACE_EVENTS_MAX_WORKERS × this)
+EVENT_FETCHER_MAX_WORKERS=2

 # Max parallel workers for forward pipeline WIP+rejects fetching
 FORWARD_PIPELINE_MAX_WORKERS=2

+# --- Admission Control (提案 1: trace-events-memory-triage) ---
+# Max container IDs per synchronous events request.
+# Requests exceeding this limit return HTTP 413 (or HTTP 202 when async job queue is enabled).
+# Set based on available RAM: 50K CIDs ≈ 2-3 GB peak memory per request.
+TRACE_EVENTS_CID_LIMIT=50000
+
+# Cursor fetchmany batch size for slow query iterator mode.
+# Smaller = less peak memory; larger = fewer Oracle round-trips.
+DB_SLOW_FETCHMANY_SIZE=5000
+
+# Domain-level cache skip threshold (CID count).
+# When CID count exceeds this, per-domain and route-level cache writes are skipped.
+EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD=10000
+
+# --- Async Job Queue (提案 2: trace-async-job-queue) ---
+# Enable RQ trace worker for async large query processing
+# Set to true and start the worker: rq worker trace-events
+TRACE_WORKER_ENABLED=false
+
+# CID threshold for automatic async job routing (requires RQ worker).
+# Requests with CID count > threshold are queued instead of processed synchronously.
+TRACE_ASYNC_CID_THRESHOLD=20000
+
+# Job result retention time in seconds (default: 3600 = 1 hour)
+TRACE_JOB_TTL_SECONDS=3600
+
+# Job execution timeout in seconds (default: 1800 = 30 minutes)
+TRACE_JOB_TIMEOUT_SECONDS=1800
+
+# Number of RQ worker processes for trace jobs
+TRACE_WORKER_COUNT=1
+
+# RQ queue name for trace jobs
+TRACE_WORKER_QUEUE=trace-events
+
+# --- Streaming Response (提案 3: trace-streaming-response) ---
+# NDJSON stream batch size (records per NDJSON line)
+TRACE_STREAM_BATCH_SIZE=5000
+
 # ============================================================
 # Performance Metrics Configuration
 # ============================================================