4.6 KiB
4.6 KiB
Migration Gates and Runbook
Gate Checklist (Cutover Readiness)
A release is cutover-ready only when all gates pass:
- Frontend build gate
npm --prefix frontend run buildsucceeds- expected artifacts exist in
src/mes_dashboard/static/dist/
- Root execution gate
- startup and deploy scripts run from repository root only
- no runtime dependency on any legacy subtree path
- Functional parity gate
- resource-history frontend compute parity checks pass
- job-query/resource-history export headers match shared field contracts
- Cache observability gate
/healthreturns route cache telemetry and degraded flags/health/deepreturns route cache telemetry for diagnostics/healthincludesdatabase_pool.runtime/state,degraded_reason- resource/wip derived index telemetry is visible (
resource_cache.derived_index,cache.derived_search_index)
- Runtime resilience gate
- pool exhaustion path returns
503+DB_POOL_EXHAUSTEDandRetry-After - circuit-open path returns
503+CIRCUIT_BREAKER_OPENand fail-fast semantics - frontend client does not aggressively retry on degraded pool exhaustion responses
- Conda-systemd contract gate
deploy/mes-dashboard.serviceanddeploy/mes-dashboard-watchdog.serviceboth run in the same conda runtime contractWATCHDOG_RESTART_FLAG,WATCHDOG_PID_FILE,WATCHDOG_STATE_FILEpaths are consistent across app/admin/watchdog- single-port bind (
GUNICORN_BIND) remains stable during restart workflow
- Regression gate
- focused unit/integration test subset passes (see validation evidence)
- Documentation alignment gate
README.md(and project-required mirror docs such asREADME.mdj) reflect current runtime architecture contract- resilience diagnostics fields (thresholds/churn/recommendation) are documented for operators
- frontend shared-core governance updates are reflected in architecture notes
Rollout Procedure
- Prepare environment
- Activate conda env (
mes-dashboard) - install Python deps:
pip install -r requirements.txt - install frontend deps:
npm --prefix frontend install
- Build frontend artifacts
npm --prefix frontend run build
- Run migration gate tests
- execute focused pytest set covering templates/cache/contracts/health
- Deploy with single-port mode
- start app with root
scripts/start_server.sh - verify portal and module pages render on same origin/port
- Conda + systemd rehearsal (recommended before production cutover)
sudo cp deploy/mes-dashboard.service /etc/systemd/system/sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.envsudo systemctl daemon-reloadsudo systemctl enable --now mes-dashboard mes-dashboard-watchdog- call
/admin/api/worker/statusand verify runtime contract paths exist
- Post-deploy checks
- call
/healthand/health/deep - confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
- trigger one controlled worker restart from admin API and verify single-port continuity
- verify README architecture section matches deployed runtime contract
Rollback Procedure
- Trigger rollback criteria
- any critical gate failure after deployment (page unusable, export mismatch, health degradation beyond acceptable limits)
- Operational rollback steps
- stop service:
scripts/start_server.sh stop - restore previously known-good build artifacts (or prior release package)
- restart service:
scripts/start_server.sh start - if using systemd:
sudo systemctl restart mes-dashboard mes-dashboard-watchdog
- Validation after rollback
- verify
/healthstatus is at least expected baseline - re-run focused smoke tests for portal + key pages
- confirm CSV export downloads and headers
- verify degraded reason is cleared or matches expected dependency outage only
Rollback Rehearsal Checklist
- Simulate failure condition (e.g. invalid dist artifact deployment)
- Execute stop/restore/start sequence
- Verify health and page smoke checks
- Capture timings and any manual intervention points
- Update this runbook if any step was unclear or missing
Alert Thresholds (Operational Contract)
Use these initial thresholds for alerting/escalation:
- Sustained degraded state
degraded_reasonnon-empty for >= 5 minutes
- Worker restart churn
-
= 3 watchdog-triggered restarts within 10 minutes
- Pool saturation pressure
database_pool.state.saturation >= 0.90for >= 3 consecutive health probes
- Frontend/API retry pressure
- significant increase of client retries for
DB_POOL_EXHAUSTEDorCIRCUIT_BREAKER_OPENresponses over baseline