feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming

Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-25 21:01:27 +08:00
parent cbb943dfe5
commit dbe0da057c
32 changed files with 3140 additions and 87 deletions

View File

@@ -85,7 +85,9 @@ LOCAL_AUTH_PASSWORD=
# Server bind address and port
GUNICORN_BIND=0.0.0.0:8080
# Number of worker processes (recommend: 2 * CPU cores + 1)
# Number of worker processes
# Recommend: 2 for ≤ 8GB RAM (trace queries consume 2-3 GB peak per worker)
# Recommend: 4 for ≥ 16GB RAM
GUNICORN_WORKERS=2
# Threads per worker
@@ -168,14 +170,55 @@ CIRCUIT_BREAKER_WINDOW_SIZE=10
TRACE_SLOW_THRESHOLD_SECONDS=15
# Max parallel workers for events domain fetching (per request)
TRACE_EVENTS_MAX_WORKERS=4
# Recommend: 2 (each worker × EVENT_FETCHER_MAX_WORKERS = peak slow query slots)
TRACE_EVENTS_MAX_WORKERS=2
# Max parallel workers for EventFetcher batch queries (per domain)
EVENT_FETCHER_MAX_WORKERS=4
# Recommend: 2 (peak concurrent slow queries = TRACE_EVENTS_MAX_WORKERS × this)
EVENT_FETCHER_MAX_WORKERS=2
# Max parallel workers for forward pipeline WIP+rejects fetching
FORWARD_PIPELINE_MAX_WORKERS=2
# --- Admission Control (提案 1: trace-events-memory-triage) ---
# Max container IDs per synchronous events request.
# Requests exceeding this limit return HTTP 413 (or HTTP 202 when async job queue is enabled).
# Set based on available RAM: 50K CIDs ≈ 2-3 GB peak memory per request.
TRACE_EVENTS_CID_LIMIT=50000
# Cursor fetchmany batch size for slow query iterator mode.
# Smaller = less peak memory; larger = fewer Oracle round-trips.
DB_SLOW_FETCHMANY_SIZE=5000
# Domain-level cache skip threshold (CID count).
# When CID count exceeds this, per-domain and route-level cache writes are skipped.
EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD=10000
# --- Async Job Queue (提案 2: trace-async-job-queue) ---
# Enable RQ trace worker for async large query processing
# Set to true and start the worker: rq worker trace-events
TRACE_WORKER_ENABLED=false
# CID threshold for automatic async job routing (requires RQ worker).
# Requests with CID count > threshold are queued instead of processed synchronously.
TRACE_ASYNC_CID_THRESHOLD=20000
# Job result retention time in seconds (default: 3600 = 1 hour)
TRACE_JOB_TTL_SECONDS=3600
# Job execution timeout in seconds (default: 1800 = 30 minutes)
TRACE_JOB_TIMEOUT_SECONDS=1800
# Number of RQ worker processes for trace jobs
TRACE_WORKER_COUNT=1
# RQ queue name for trace jobs
TRACE_WORKER_QUEUE=trace-events
# --- Streaming Response (提案 3: trace-streaming-response) ---
# NDJSON stream batch size (records per NDJSON line)
TRACE_STREAM_BATCH_SIZE=5000
# ============================================================
# Performance Metrics Configuration
# ============================================================