Files
DashBoard/docs/migration_gates_and_runbook.md
2026-02-08 08:30:48 +08:00

4.6 KiB

Migration Gates and Runbook

Gate Checklist (Cutover Readiness)

A release is cutover-ready only when all gates pass:

  1. Frontend build gate
  • npm --prefix frontend run build succeeds
  • expected artifacts exist in src/mes_dashboard/static/dist/
  1. Root execution gate
  • startup and deploy scripts run from repository root only
  • no runtime dependency on any legacy subtree path
  1. Functional parity gate
  • resource-history frontend compute parity checks pass
  • job-query/resource-history export headers match shared field contracts
  1. Cache observability gate
  • /health returns route cache telemetry and degraded flags
  • /health/deep returns route cache telemetry for diagnostics
  • /health includes database_pool.runtime/state, degraded_reason
  • resource/wip derived index telemetry is visible (resource_cache.derived_index, cache.derived_search_index)
  1. Runtime resilience gate
  • pool exhaustion path returns 503 + DB_POOL_EXHAUSTED and Retry-After
  • circuit-open path returns 503 + CIRCUIT_BREAKER_OPEN and fail-fast semantics
  • frontend client does not aggressively retry on degraded pool exhaustion responses
  1. Conda-systemd contract gate
  • deploy/mes-dashboard.service and deploy/mes-dashboard-watchdog.service both run in the same conda runtime contract
  • WATCHDOG_RESTART_FLAG, WATCHDOG_PID_FILE, WATCHDOG_STATE_FILE paths are consistent across app/admin/watchdog
  • single-port bind (GUNICORN_BIND) remains stable during restart workflow
  1. Regression gate
  • focused unit/integration test subset passes (see validation evidence)
  1. Documentation alignment gate
  • README.md (and project-required mirror docs such as README.mdj) reflect current runtime architecture contract
  • resilience diagnostics fields (thresholds/churn/recommendation) are documented for operators
  • frontend shared-core governance updates are reflected in architecture notes

Rollout Procedure

  1. Prepare environment
  • Activate conda env (mes-dashboard)
  • install Python deps: pip install -r requirements.txt
  • install frontend deps: npm --prefix frontend install
  1. Build frontend artifacts
  • npm --prefix frontend run build
  1. Run migration gate tests
  • execute focused pytest set covering templates/cache/contracts/health
  1. Deploy with single-port mode
  • start app with root scripts/start_server.sh
  • verify portal and module pages render on same origin/port
  1. Conda + systemd rehearsal (recommended before production cutover)
  • sudo cp deploy/mes-dashboard.service /etc/systemd/system/
  • sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
  • sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env
  • sudo systemctl daemon-reload
  • sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog
  • call /admin/api/worker/status and verify runtime contract paths exist
  1. Post-deploy checks
  • call /health and /health/deep
  • confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
  • trigger one controlled worker restart from admin API and verify single-port continuity
  • verify README architecture section matches deployed runtime contract

Rollback Procedure

  1. Trigger rollback criteria
  • any critical gate failure after deployment (page unusable, export mismatch, health degradation beyond acceptable limits)
  1. Operational rollback steps
  • stop service: scripts/start_server.sh stop
  • restore previously known-good build artifacts (or prior release package)
  • restart service: scripts/start_server.sh start
  • if using systemd: sudo systemctl restart mes-dashboard mes-dashboard-watchdog
  1. Validation after rollback
  • verify /health status is at least expected baseline
  • re-run focused smoke tests for portal + key pages
  • confirm CSV export downloads and headers
  • verify degraded reason is cleared or matches expected dependency outage only

Rollback Rehearsal Checklist

  1. Simulate failure condition (e.g. invalid dist artifact deployment)
  2. Execute stop/restore/start sequence
  3. Verify health and page smoke checks
  4. Capture timings and any manual intervention points
  5. Update this runbook if any step was unclear or missing

Alert Thresholds (Operational Contract)

Use these initial thresholds for alerting/escalation:

  1. Sustained degraded state
  • degraded_reason non-empty for >= 5 minutes
  1. Worker restart churn
  • = 3 watchdog-triggered restarts within 10 minutes

  1. Pool saturation pressure
  • database_pool.state.saturation >= 0.90 for >= 3 consecutive health probes
  1. Frontend/API retry pressure
  • significant increase of client retries for DB_POOL_EXHAUSTED or CIRCUIT_BREAKER_OPEN responses over baseline