feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 14:00:07 +08:00
parent f1506787fb
commit a275c30c0e
35 changed files with 3028 additions and 1460 deletions
--- a/.env.example
+++ b/.env.example
@@ -59,6 +59,16 @@ QUERY_TOOL_MAX_CONTAINER_IDS=200
 RESOURCE_DETAIL_DEFAULT_LIMIT=500
 RESOURCE_DETAIL_MAX_LIMIT=500

+# Shared container-resolution guardrails
+# 0 = disable raw input count cap (recommended: rely on expansion limits instead)
+CONTAINER_RESOLVE_INPUT_MAX_VALUES=0
+# Wildcard pattern must include this many literal-prefix chars before %/_ (e.g., GA%)
+CONTAINER_RESOLVE_PATTERN_MIN_PREFIX_LEN=4
+# Per-token expansion guard (avoid one wildcard exploding into too many container IDs)
+CONTAINER_RESOLVE_MAX_EXPANSION_PER_TOKEN=2000
+# Total resolved container-ID guard for a single resolve request
+CONTAINER_RESOLVE_MAX_CONTAINER_IDS=30000
+
 # Trust boundary for forwarded headers (safe default: false)
 # Direct-exposure deployment (no reverse proxy): keep this false
 TRUST_PROXY_HEADERS=false
@@ -101,14 +111,14 @@ GUNICORN_WORKERS=2
 GUNICORN_THREADS=4

 # Worker timeout (seconds): should stay above DB/query-tool slow paths
-GUNICORN_TIMEOUT=130
+GUNICORN_TIMEOUT=360

 # Graceful shutdown timeout for worker reloads (seconds)
-GUNICORN_GRACEFUL_TIMEOUT=60
+GUNICORN_GRACEFUL_TIMEOUT=300

 # Worker recycle policy (set 0 to disable)
-GUNICORN_MAX_REQUESTS=5000
-GUNICORN_MAX_REQUESTS_JITTER=500
+GUNICORN_MAX_REQUESTS=1200
+GUNICORN_MAX_REQUESTS_JITTER=300

 # ============================================================
 # Redis Configuration (for WIP cache)
@@ -201,6 +211,8 @@ TRACE_EVENTS_MAX_WORKERS=2
 # Max parallel workers for EventFetcher batch queries (per domain)
 # Recommend: 2 (peak concurrent slow queries = TRACE_EVENTS_MAX_WORKERS × this)
 EVENT_FETCHER_MAX_WORKERS=2
+# false = any failed batch raises error (avoid silent partial data)
+EVENT_FETCHER_ALLOW_PARTIAL_RESULTS=false

 # Max parallel workers for forward pipeline WIP+rejects fetching
 FORWARD_PIPELINE_MAX_WORKERS=2
@@ -351,7 +363,7 @@ REJECT_ENGINE_SPOOL_CLEANUP_INTERVAL_SECONDS=300
 REJECT_ENGINE_SPOOL_ORPHAN_GRACE_SECONDS=600

 # Batch query engine thresholds
-BATCH_QUERY_TIME_THRESHOLD_DAYS=60
+BATCH_QUERY_TIME_THRESHOLD_DAYS=10
 BATCH_QUERY_ID_THRESHOLD=1000
 BATCH_CHUNK_MAX_MEMORY_MB=256