feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
49
.env.example
49
.env.example
@@ -85,7 +85,9 @@ LOCAL_AUTH_PASSWORD=
|
||||
# Server bind address and port
|
||||
GUNICORN_BIND=0.0.0.0:8080
|
||||
|
||||
# Number of worker processes (recommend: 2 * CPU cores + 1)
|
||||
# Number of worker processes
|
||||
# Recommend: 2 for ≤ 8GB RAM (trace queries consume 2-3 GB peak per worker)
|
||||
# Recommend: 4 for ≥ 16GB RAM
|
||||
GUNICORN_WORKERS=2
|
||||
|
||||
# Threads per worker
|
||||
@@ -168,14 +170,55 @@ CIRCUIT_BREAKER_WINDOW_SIZE=10
|
||||
TRACE_SLOW_THRESHOLD_SECONDS=15
|
||||
|
||||
# Max parallel workers for events domain fetching (per request)
|
||||
TRACE_EVENTS_MAX_WORKERS=4
|
||||
# Recommend: 2 (each worker × EVENT_FETCHER_MAX_WORKERS = peak slow query slots)
|
||||
TRACE_EVENTS_MAX_WORKERS=2
|
||||
|
||||
# Max parallel workers for EventFetcher batch queries (per domain)
|
||||
EVENT_FETCHER_MAX_WORKERS=4
|
||||
# Recommend: 2 (peak concurrent slow queries = TRACE_EVENTS_MAX_WORKERS × this)
|
||||
EVENT_FETCHER_MAX_WORKERS=2
|
||||
|
||||
# Max parallel workers for forward pipeline WIP+rejects fetching
|
||||
FORWARD_PIPELINE_MAX_WORKERS=2
|
||||
|
||||
# --- Admission Control (提案 1: trace-events-memory-triage) ---
|
||||
# Max container IDs per synchronous events request.
|
||||
# Requests exceeding this limit return HTTP 413 (or HTTP 202 when async job queue is enabled).
|
||||
# Set based on available RAM: 50K CIDs ≈ 2-3 GB peak memory per request.
|
||||
TRACE_EVENTS_CID_LIMIT=50000
|
||||
|
||||
# Cursor fetchmany batch size for slow query iterator mode.
|
||||
# Smaller = less peak memory; larger = fewer Oracle round-trips.
|
||||
DB_SLOW_FETCHMANY_SIZE=5000
|
||||
|
||||
# Domain-level cache skip threshold (CID count).
|
||||
# When CID count exceeds this, per-domain and route-level cache writes are skipped.
|
||||
EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD=10000
|
||||
|
||||
# --- Async Job Queue (提案 2: trace-async-job-queue) ---
|
||||
# Enable RQ trace worker for async large query processing
|
||||
# Set to true and start the worker: rq worker trace-events
|
||||
TRACE_WORKER_ENABLED=false
|
||||
|
||||
# CID threshold for automatic async job routing (requires RQ worker).
|
||||
# Requests with CID count > threshold are queued instead of processed synchronously.
|
||||
TRACE_ASYNC_CID_THRESHOLD=20000
|
||||
|
||||
# Job result retention time in seconds (default: 3600 = 1 hour)
|
||||
TRACE_JOB_TTL_SECONDS=3600
|
||||
|
||||
# Job execution timeout in seconds (default: 1800 = 30 minutes)
|
||||
TRACE_JOB_TIMEOUT_SECONDS=1800
|
||||
|
||||
# Number of RQ worker processes for trace jobs
|
||||
TRACE_WORKER_COUNT=1
|
||||
|
||||
# RQ queue name for trace jobs
|
||||
TRACE_WORKER_QUEUE=trace-events
|
||||
|
||||
# --- Streaming Response (提案 3: trace-streaming-response) ---
|
||||
# NDJSON stream batch size (records per NDJSON line)
|
||||
TRACE_STREAM_BATCH_SIZE=5000
|
||||
|
||||
# ============================================================
|
||||
# Performance Metrics Configuration
|
||||
# ============================================================
|
||||
|
||||
Reference in New Issue
Block a user