egg/DashBoard

Files

beabigegg b56e80381b chore: reinitialize project with vite architecture

2026-02-08 08:30:48 +08:00

4.6 KiB

Raw Blame History

Migration Gates and Runbook

Gate Checklist (Cutover Readiness)

A release is cutover-ready only when all gates pass:

Frontend build gate

npm --prefix frontend run build succeeds
expected artifacts exist in src/mes_dashboard/static/dist/

Root execution gate

startup and deploy scripts run from repository root only
no runtime dependency on any legacy subtree path

Functional parity gate

resource-history frontend compute parity checks pass
job-query/resource-history export headers match shared field contracts

Cache observability gate

/health returns route cache telemetry and degraded flags
/health/deep returns route cache telemetry for diagnostics
/health includes database_pool.runtime/state, degraded_reason
resource/wip derived index telemetry is visible (resource_cache.derived_index, cache.derived_search_index)

Runtime resilience gate

pool exhaustion path returns 503 + DB_POOL_EXHAUSTED and Retry-After
circuit-open path returns 503 + CIRCUIT_BREAKER_OPEN and fail-fast semantics
frontend client does not aggressively retry on degraded pool exhaustion responses

Conda-systemd contract gate

deploy/mes-dashboard.service and deploy/mes-dashboard-watchdog.service both run in the same conda runtime contract
WATCHDOG_RESTART_FLAG, WATCHDOG_PID_FILE, WATCHDOG_STATE_FILE paths are consistent across app/admin/watchdog
single-port bind (GUNICORN_BIND) remains stable during restart workflow

Regression gate

focused unit/integration test subset passes (see validation evidence)

Documentation alignment gate

README.md (and project-required mirror docs such as README.mdj) reflect current runtime architecture contract
resilience diagnostics fields (thresholds/churn/recommendation) are documented for operators
frontend shared-core governance updates are reflected in architecture notes

Rollout Procedure

Prepare environment

Activate conda env (mes-dashboard)
install Python deps: pip install -r requirements.txt
install frontend deps: npm --prefix frontend install

Build frontend artifacts

npm --prefix frontend run build

Run migration gate tests

execute focused pytest set covering templates/cache/contracts/health

Deploy with single-port mode

start app with root scripts/start_server.sh
verify portal and module pages render on same origin/port

Conda + systemd rehearsal (recommended before production cutover)

sudo cp deploy/mes-dashboard.service /etc/systemd/system/
sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env
sudo systemctl daemon-reload
sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog
call /admin/api/worker/status and verify runtime contract paths exist

Post-deploy checks

call /health and /health/deep
confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
trigger one controlled worker restart from admin API and verify single-port continuity
verify README architecture section matches deployed runtime contract

Rollback Procedure

Trigger rollback criteria

any critical gate failure after deployment (page unusable, export mismatch, health degradation beyond acceptable limits)

Operational rollback steps

stop service: scripts/start_server.sh stop
restore previously known-good build artifacts (or prior release package)
restart service: scripts/start_server.sh start
if using systemd: sudo systemctl restart mes-dashboard mes-dashboard-watchdog

Validation after rollback

verify /health status is at least expected baseline
re-run focused smoke tests for portal + key pages
confirm CSV export downloads and headers
verify degraded reason is cleared or matches expected dependency outage only

Rollback Rehearsal Checklist

Simulate failure condition (e.g. invalid dist artifact deployment)
Execute stop/restore/start sequence
Verify health and page smoke checks
Capture timings and any manual intervention points
Update this runbook if any step was unclear or missing

Alert Thresholds (Operational Contract)

Use these initial thresholds for alerting/escalation:

Sustained degraded state

degraded_reason non-empty for >= 5 minutes

Worker restart churn

= 3 watchdog-triggered restarts within 10 minutes

Pool saturation pressure

database_pool.state.saturation >= 0.90 for >= 3 consecutive health probes

Frontend/API retry pressure

significant increase of client retries for DB_POOL_EXHAUSTED or CIRCUIT_BREAKER_OPEN responses over baseline