feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked
in Redis but never surfaced to the API response or frontend. Users could see incomplete data
without any warning. This commit closes the gap end-to-end:

Backend:
- Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata
- Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk`
- Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup
- Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta
- Persist partial failure flag to independent Redis key with TTL aligned to data storage layer
- Add shared container-resolution policy module with wildcard/expansion guardrails
- Refactor reason filter from single-value to multi-select (`reason` → `reasons`)

Frontend:
- Add client-side date range validation (730-day limit) before API submission
- Display amber warning banner on partial failure with specific failed date ranges
- Support generic fallback message for container-mode queries without date ranges
- Update FilterPanel to support multi-select reason chips

Specs & tests:
- Create batch-query-resilience spec; update reject-history-api and reject-history-page specs
- Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL
- Cross-service regression verified (hold, resource, job, msd — 411 tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-03-03 14:00:07 +08:00
parent f1506787fb
commit a275c30c0e
35 changed files with 3028 additions and 1460 deletions

View File

@@ -59,6 +59,16 @@ QUERY_TOOL_MAX_CONTAINER_IDS=200
RESOURCE_DETAIL_DEFAULT_LIMIT=500
RESOURCE_DETAIL_MAX_LIMIT=500
# Shared container-resolution guardrails
# 0 = disable raw input count cap (recommended: rely on expansion limits instead)
CONTAINER_RESOLVE_INPUT_MAX_VALUES=0
# Wildcard pattern must include this many literal-prefix chars before %/_ (e.g., GA%)
CONTAINER_RESOLVE_PATTERN_MIN_PREFIX_LEN=4
# Per-token expansion guard (avoid one wildcard exploding into too many container IDs)
CONTAINER_RESOLVE_MAX_EXPANSION_PER_TOKEN=2000
# Total resolved container-ID guard for a single resolve request
CONTAINER_RESOLVE_MAX_CONTAINER_IDS=30000
# Trust boundary for forwarded headers (safe default: false)
# Direct-exposure deployment (no reverse proxy): keep this false
TRUST_PROXY_HEADERS=false
@@ -101,14 +111,14 @@ GUNICORN_WORKERS=2
GUNICORN_THREADS=4
# Worker timeout (seconds): should stay above DB/query-tool slow paths
GUNICORN_TIMEOUT=130
GUNICORN_TIMEOUT=360
# Graceful shutdown timeout for worker reloads (seconds)
GUNICORN_GRACEFUL_TIMEOUT=60
GUNICORN_GRACEFUL_TIMEOUT=300
# Worker recycle policy (set 0 to disable)
GUNICORN_MAX_REQUESTS=5000
GUNICORN_MAX_REQUESTS_JITTER=500
GUNICORN_MAX_REQUESTS=1200
GUNICORN_MAX_REQUESTS_JITTER=300
# ============================================================
# Redis Configuration (for WIP cache)
@@ -201,6 +211,8 @@ TRACE_EVENTS_MAX_WORKERS=2
# Max parallel workers for EventFetcher batch queries (per domain)
# Recommend: 2 (peak concurrent slow queries = TRACE_EVENTS_MAX_WORKERS × this)
EVENT_FETCHER_MAX_WORKERS=2
# false = any failed batch raises error (avoid silent partial data)
EVENT_FETCHER_ALLOW_PARTIAL_RESULTS=false
# Max parallel workers for forward pipeline WIP+rejects fetching
FORWARD_PIPELINE_MAX_WORKERS=2
@@ -351,7 +363,7 @@ REJECT_ENGINE_SPOOL_CLEANUP_INTERVAL_SECONDS=300
REJECT_ENGINE_SPOOL_ORPHAN_GRACE_SECONDS=600
# Batch query engine thresholds
BATCH_QUERY_TIME_THRESHOLD_DAYS=60
BATCH_QUERY_TIME_THRESHOLD_DAYS=10
BATCH_QUERY_ID_THRESHOLD=1000
BATCH_CHUNK_MAX_MEMORY_MB=256