From e25503941947ffd79ea1eac8b20757b20c8e7a24 Mon Sep 17 00:00:00 2001 From: egg Date: Sun, 14 Dec 2025 15:08:33 +0800 Subject: [PATCH] chore: remove AI dev files from repo and clean up env config MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove .claude/, openspec/, AGENTS.md, CLAUDE.md from git tracking - Simplify .env.example: remove unused path configs (use config.py defaults) - Clean up .env for production: remove hardcoded secrets, use env var substitution - Path configs now use sensible defaults from backend/app/core/config.py: - uploads -> backend/uploads/ - storage -> backend/storage/ - results -> backend/storage/results/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- .claude/commands/openspec/apply.md | 23 - .claude/commands/openspec/archive.md | 27 - .claude/commands/openspec/proposal.md | 28 - .env | 85 +- .env.example | 55 +- AGENTS.md | 18 - CLAUDE.md | 18 - openspec/AGENTS.md | 456 ------ .../proposal.md | 24 - .../specs/result-export/spec.md | 46 - .../specs/task-management/spec.md | 51 - .../2025-11-17-fix-v2-api-ui-issues/tasks.md | 40 - .../proposal.md | 84 -- .../specs/environment-setup/spec.md | 77 - .../specs/ocr-processing/spec.md | 89 -- .../tasks.md | 103 -- .../OFFICE_INTEGRATION.md | 186 --- .../SESSION_SUMMARY.md | 294 ---- .../STATUS.md | 616 -------- .../design.md | 313 ---- .../proposal.md | 48 - .../specs/export-results/spec.md | 175 --- .../specs/file-management/spec.md | 96 -- .../specs/ocr-processing/spec.md | 125 -- .../tasks.md | 230 --- .../IMPLEMENTATION.md | 122 -- .../design.md | 176 --- .../proposal.md | 52 - .../specs/file-processing/spec.md | 54 - .../tasks.md | 70 - .../ARCHITECTURE-REFACTOR-PLAN.md | 817 ----------- .../PP-STRUCTURE-ENHANCEMENT-PLAN.md | 691 --------- .../paddleocr_layout_recovery_research.md | 1279 ----------------- .../proposal.md | 148 -- .../specs/result-export/spec.md | 57 - .../specs/task-management/spec.md | 63 - .../tasks.md | 106 -- .../FRONTEND_IMPLEMENTATION.md | 519 ------- .../IMPLEMENTATION_COMPLETE.md | 556 ------- .../PROGRESS_UPDATE.md | 304 ---- .../database_schema.sql | 183 --- .../proposal.md | 294 ---- .../tasks.md | 276 ---- .../ARCHIVE.md | 427 ------ .../design.md | 392 ----- .../proposal.md | 35 - .../specs/document-processing/spec.md | 108 -- .../specs/result-export/spec.md | 74 - .../specs/task-management/spec.md | 105 -- .../tasks.md | 207 --- .../design.md | 361 ----- .../proposal.md | 57 - .../specs/result-export/spec.md | 88 -- .../tasks.md | 234 --- .../proposal.md | 50 - .../specs/result-export/spec.md | 38 - .../tasks.md | 54 - .../IMPLEMENTATION_SUMMARY.md | 362 ----- .../proposal.md | 207 --- .../specs/ocr-processing/spec.md | 100 -- .../tasks.md | 178 --- .../delta-ocr-processing.md | 146 -- .../delta-task-management.md | 225 --- .../design.md | 587 -------- .../proposal.md | 77 - .../specs/memory-management/spec.md | 104 -- .../tasks.md | 176 --- .../proposal.md | 28 - .../specs/ocr-processing/spec.md | 61 - .../tasks.md | 43 - .../design.md | 173 --- .../proposal.md | 45 - .../specs/ocr-processing/spec.md | 51 - .../tasks.md | 43 - .../design.md | 192 --- .../proposal.md | 74 - .../specs/ocr-processing/spec.md | 128 -- .../tasks.md | 141 -- .../design.md | 183 --- .../proposal.md | 30 - .../specs/ocr-processing/spec.md | 111 -- .../tasks.md | 44 - .../proposal.md | 108 -- .../specs/pdf-generation/spec.md | 52 - .../tasks.md | 55 - .../proposal.md | 40 - .../specs/ocr-processing/spec.md | 86 -- .../tasks.md | 56 - .../MODEL_CLEANUP.md | 141 -- .../proposal.md | 134 -- .../specs/ocr-processing/spec.md | 56 - .../tasks.md | 77 - .../proposal.md | 72 - .../specs/ocr-processing/spec.md | 42 - .../2025-11-28-unify-image-scaling/tasks.md | 113 -- .../proposal.md | 134 -- .../specs/ocr-processing/spec.md | 132 -- .../tasks.md | 273 ---- .../design.md | 265 ---- .../proposal.md | 54 - .../specs/result-export/spec.md | 55 - .../specs/translation/spec.md | 184 --- .../tasks.md | 121 -- .../design.md | 91 -- .../proposal.md | 29 - .../specs/result-export/spec.md | 55 - .../specs/translation/spec.md | 72 - .../tasks.md | 40 - .../design.md | 96 -- .../proposal.md | 25 - .../specs/development-environment/spec.md | 73 - .../tasks.md | 31 - .../design.md | 68 - .../proposal.md | 18 - .../specs/result-export/spec.md | 41 - .../tasks.md | 20 - .../design.md | 167 --- .../proposal.md | 41 - .../specs/result-export/spec.md | 137 -- .../tasks.md | 30 - .../design.md | 458 ------ .../proposal.md | 44 - .../tasks.md | 93 -- .../proposal.md | 73 - .../specs/ocr-processing/spec.md | 64 - .../tasks.md | 124 -- .../design.md | 240 ---- .../proposal.md | 68 - .../specs/document-processing/spec.md | 153 -- .../tasks.md | 110 -- .../design.md | 227 --- .../proposal.md | 116 -- .../specs/ocr-processing/spec.md | 96 -- .../tasks.md | 75 - .../test-notes.md | 14 - .../2025-12-11-cleanup-dead-code/proposal.md | 175 --- .../specs/document-processing/spec.md | 42 - .../2025-12-11-cleanup-dead-code/tasks.md | 92 -- .../proposal.md | 52 - .../specs/ocr-processing/spec.md | 80 -- .../tasks.md | 71 - .../design.md | 88 -- .../proposal.md | 17 - .../specs/ocr-processing/spec.md | 91 -- .../tasks.md | 34 - .../design.md | 227 --- .../proposal.md | 56 - .../specs/document-processing/spec.md | 59 - .../tasks.md | 59 - .../proposal.md | 49 - .../specs/ocr-processing/spec.md | 142 -- .../tasks.md | 54 - .../2025-12-11-remove-unused-code/proposal.md | 55 - .../specs/document-processing/spec.md | 61 - .../2025-12-11-remove-unused-code/tasks.md | 52 - .../design.md | 141 -- .../proposal.md | 42 - .../tasks.md | 57 - .../proposal.md | 59 - .../specs/result-export/spec.md | 24 - .../tasks.md | 57 - .../proposal.md | 25 - .../specs/ocr-processing/spec.md | 127 -- .../tasks.md | 51 - .../design.md | 130 -- .../proposal.md | 54 - .../specs/document-processing/spec.md | 43 - .../specs/result-export/spec.md | 36 - .../specs/translation/spec.md | 46 - .../tasks.md | 78 - .../design.md | 234 --- .../proposal.md | 75 - .../specs/document-processing/spec.md | 36 - .../tasks.md | 48 - .../proposal.md | 43 - .../specs/frontend-ui/spec.md | 100 -- .../2025-12-12-add-batch-processing/tasks.md | 42 - .../proposal.md | 51 - .../specs/result-export/spec.md | 23 - .../tasks.md | 51 - .../proposal.md | 70 - .../specs/translation/spec.md | 56 - .../tasks.md | 76 - .../proposal.md | 201 --- .../tasks.md | 68 - .../proposal.md | 38 - .../specs/frontend-ui/spec.md | 89 -- .../tasks.md | 29 - .../proposal.md | 60 - .../specs/task-management/spec.md | 116 -- .../2025-12-14-add-storage-cleanup/tasks.md | 49 - .../proposal.md | 52 - .../2025-12-14-enable-audit-logging/tasks.md | 33 - .../proposal.md | 62 - .../specs/backend-api/spec.md | 22 - .../specs/frontend-ui/spec.md | 31 - .../tasks.md | 39 - openspec/project.md | 341 ----- openspec/specs/document-processing/spec.md | 183 --- openspec/specs/frontend-ui/spec.md | 188 --- openspec/specs/ocr-processing/spec.md | 311 ---- openspec/specs/result-export/spec.md | 207 --- openspec/specs/task-management/spec.md | 199 --- openspec/specs/translation/spec.md | 304 ---- 204 files changed, 35 insertions(+), 26070 deletions(-) delete mode 100644 .claude/commands/openspec/apply.md delete mode 100644 .claude/commands/openspec/archive.md delete mode 100644 .claude/commands/openspec/proposal.md delete mode 100644 AGENTS.md delete mode 100644 CLAUDE.md delete mode 100644 openspec/AGENTS.md delete mode 100644 openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/proposal.md delete mode 100644 openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/task-management/spec.md delete mode 100644 openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/tasks.md delete mode 100644 openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/proposal.md delete mode 100644 openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/environment-setup/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/tasks.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/OFFICE_INTEGRATION.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/SESSION_SUMMARY.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/STATUS.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/design.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/proposal.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-ocr-batch-processing/tasks.md delete mode 100644 openspec/changes/archive/2025-11-18-add-office-document-support/IMPLEMENTATION.md delete mode 100644 openspec/changes/archive/2025-11-18-add-office-document-support/design.md delete mode 100644 openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md delete mode 100644 openspec/changes/archive/2025-11-18-add-office-document-support/specs/file-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/ARCHITECTURE-REFACTOR-PLAN.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/PP-STRUCTURE-ENHANCEMENT-PLAN.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/paddleocr_layout_recovery_research.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/proposal.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/task-management/spec.md delete mode 100644 openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/tasks.md delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/FRONTEND_IMPLEMENTATION.md delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/IMPLEMENTATION_COMPLETE.md delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/PROGRESS_UPDATE.md delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/database_schema.sql delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/proposal.md delete mode 100644 openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/tasks.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/ARCHIVE.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/design.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/proposal.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/task-management/spec.md delete mode 100644 openspec/changes/archive/2025-11-20-dual-track-document-processing/tasks.md delete mode 100644 openspec/changes/archive/2025-11-24-pdf-layout-restoration/design.md delete mode 100644 openspec/changes/archive/2025-11-24-pdf-layout-restoration/proposal.md delete mode 100644 openspec/changes/archive/2025-11-24-pdf-layout-restoration/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-11-24-pdf-layout-restoration/tasks.md delete mode 100644 openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/proposal.md delete mode 100644 openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/tasks.md delete mode 100644 openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/IMPLEMENTATION_SUMMARY.md delete mode 100644 openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/proposal.md delete mode 100644 openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/tasks.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/delta-ocr-processing.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/delta-task-management.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/design.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/proposal.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/specs/memory-management/spec.md delete mode 100644 openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/tasks.md delete mode 100644 openspec/changes/archive/2025-11-27-add-layout-preprocessing/design.md delete mode 100644 openspec/changes/archive/2025-11-27-add-layout-preprocessing/proposal.md delete mode 100644 openspec/changes/archive/2025-11-27-add-layout-preprocessing/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-27-add-layout-preprocessing/tasks.md delete mode 100644 openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md delete mode 100644 openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md delete mode 100644 openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md delete mode 100644 openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md delete mode 100644 openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md delete mode 100644 openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md delete mode 100644 openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md delete mode 100644 openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md delete mode 100644 openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md delete mode 100644 openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md delete mode 100644 openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md delete mode 100644 openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md delete mode 100644 openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md delete mode 100644 openspec/changes/archive/2025-11-30-extract-table-cell-boxes/proposal.md delete mode 100644 openspec/changes/archive/2025-11-30-extract-table-cell-boxes/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-11-30-extract-table-cell-boxes/tasks.md delete mode 100644 openspec/changes/archive/2025-12-02-add-document-translation/design.md delete mode 100644 openspec/changes/archive/2025-12-02-add-document-translation/proposal.md delete mode 100644 openspec/changes/archive/2025-12-02-add-document-translation/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-02-add-document-translation/specs/translation/spec.md delete mode 100644 openspec/changes/archive/2025-12-02-add-document-translation/tasks.md delete mode 100644 openspec/changes/archive/2025-12-02-add-translated-pdf-export/design.md delete mode 100644 openspec/changes/archive/2025-12-02-add-translated-pdf-export/proposal.md delete mode 100644 openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/translation/spec.md delete mode 100644 openspec/changes/archive/2025-12-02-add-translated-pdf-export/tasks.md delete mode 100644 openspec/changes/archive/2025-12-02-unify-environment-scripts/design.md delete mode 100644 openspec/changes/archive/2025-12-02-unify-environment-scripts/proposal.md delete mode 100644 openspec/changes/archive/2025-12-02-unify-environment-scripts/specs/development-environment/spec.md delete mode 100644 openspec/changes/archive/2025-12-02-unify-environment-scripts/tasks.md delete mode 100644 openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md delete mode 100644 openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/proposal.md delete mode 100644 openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/tasks.md delete mode 100644 openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md delete mode 100644 openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md delete mode 100644 openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md delete mode 100644 openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md delete mode 100644 openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md delete mode 100644 openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md delete mode 100644 openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md delete mode 100644 openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md delete mode 100644 openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/design.md delete mode 100644 openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/proposal.md delete mode 100644 openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/tasks.md delete mode 100644 openspec/changes/archive/2025-12-10-add-ocr-processing-presets/design.md delete mode 100644 openspec/changes/archive/2025-12-10-add-ocr-processing-presets/proposal.md delete mode 100644 openspec/changes/archive/2025-12-10-add-ocr-processing-presets/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-10-add-ocr-processing-presets/tasks.md delete mode 100644 openspec/changes/archive/2025-12-10-add-ocr-processing-presets/test-notes.md delete mode 100644 openspec/changes/archive/2025-12-11-cleanup-dead-code/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-cleanup-dead-code/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-cleanup-dead-code/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-table-column-alignment/design.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-table-column-alignment/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-table-column-alignment/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-remove-unused-code/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-remove-unused-code/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-remove-unused-code/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-simple-text-positioning/design.md delete mode 100644 openspec/changes/archive/2025-12-11-simple-text-positioning/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-simple-text-positioning/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-export-options/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-export-options/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-export-options/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/specs/ocr-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/design.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/translation/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/tasks.md delete mode 100644 openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md delete mode 100644 openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md delete mode 100644 openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md delete mode 100644 openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md delete mode 100644 openspec/changes/archive/2025-12-12-add-batch-processing/proposal.md delete mode 100644 openspec/changes/archive/2025-12-12-add-batch-processing/specs/frontend-ui/spec.md delete mode 100644 openspec/changes/archive/2025-12-12-add-batch-processing/tasks.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md delete mode 100644 openspec/changes/archive/2025-12-12-fix-ocr-track-translation/tasks.md delete mode 100644 openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/proposal.md delete mode 100644 openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/tasks.md delete mode 100644 openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/proposal.md delete mode 100644 openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/specs/frontend-ui/spec.md delete mode 100644 openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/tasks.md delete mode 100644 openspec/changes/archive/2025-12-14-add-storage-cleanup/proposal.md delete mode 100644 openspec/changes/archive/2025-12-14-add-storage-cleanup/specs/task-management/spec.md delete mode 100644 openspec/changes/archive/2025-12-14-add-storage-cleanup/tasks.md delete mode 100644 openspec/changes/archive/2025-12-14-enable-audit-logging/proposal.md delete mode 100644 openspec/changes/archive/2025-12-14-enable-audit-logging/tasks.md delete mode 100644 openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/proposal.md delete mode 100644 openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/backend-api/spec.md delete mode 100644 openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/frontend-ui/spec.md delete mode 100644 openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/tasks.md delete mode 100644 openspec/project.md delete mode 100644 openspec/specs/document-processing/spec.md delete mode 100644 openspec/specs/frontend-ui/spec.md delete mode 100644 openspec/specs/ocr-processing/spec.md delete mode 100644 openspec/specs/result-export/spec.md delete mode 100644 openspec/specs/task-management/spec.md delete mode 100644 openspec/specs/translation/spec.md diff --git a/.claude/commands/openspec/apply.md b/.claude/commands/openspec/apply.md deleted file mode 100644 index a36fd96..0000000 --- a/.claude/commands/openspec/apply.md +++ /dev/null @@ -1,23 +0,0 @@ ---- -name: OpenSpec: Apply -description: Implement an approved OpenSpec change and keep tasks in sync. -category: OpenSpec -tags: [openspec, apply] ---- - -**Guardrails** -- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. -- Keep changes tightly scoped to the requested outcome. -- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. - -**Steps** -Track these steps as TODOs and complete them one by one. -1. Read `changes//proposal.md`, `design.md` (if present), and `tasks.md` to confirm scope and acceptance criteria. -2. Work through tasks sequentially, keeping edits minimal and focused on the requested change. -3. Confirm completion before updating statuses—make sure every item in `tasks.md` is finished. -4. Update the checklist after all work is done so each task is marked `- [x]` and reflects reality. -5. Reference `openspec list` or `openspec show ` when additional context is required. - -**Reference** -- Use `openspec show --json --deltas-only` if you need additional context from the proposal while implementing. - diff --git a/.claude/commands/openspec/archive.md b/.claude/commands/openspec/archive.md deleted file mode 100644 index dbc7695..0000000 --- a/.claude/commands/openspec/archive.md +++ /dev/null @@ -1,27 +0,0 @@ ---- -name: OpenSpec: Archive -description: Archive a deployed OpenSpec change and update specs. -category: OpenSpec -tags: [openspec, archive] ---- - -**Guardrails** -- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. -- Keep changes tightly scoped to the requested outcome. -- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. - -**Steps** -1. Determine the change ID to archive: - - If this prompt already includes a specific change ID (for example inside a `` block populated by slash-command arguments), use that value after trimming whitespace. - - If the conversation references a change loosely (for example by title or summary), run `openspec list` to surface likely IDs, share the relevant candidates, and confirm which one the user intends. - - Otherwise, review the conversation, run `openspec list`, and ask the user which change to archive; wait for a confirmed change ID before proceeding. - - If you still cannot identify a single change ID, stop and tell the user you cannot archive anything yet. -2. Validate the change ID by running `openspec list` (or `openspec show `) and stop if the change is missing, already archived, or otherwise not ready to archive. -3. Run `openspec archive --yes` so the CLI moves the change and applies spec updates without prompts (use `--skip-specs` only for tooling-only work). -4. Review the command output to confirm the target specs were updated and the change landed in `changes/archive/`. -5. Validate with `openspec validate --strict` and inspect with `openspec show ` if anything looks off. - -**Reference** -- Use `openspec list` to confirm change IDs before archiving. -- Inspect refreshed specs with `openspec list --specs` and address any validation issues before handing off. - diff --git a/.claude/commands/openspec/proposal.md b/.claude/commands/openspec/proposal.md deleted file mode 100644 index cbb75ce..0000000 --- a/.claude/commands/openspec/proposal.md +++ /dev/null @@ -1,28 +0,0 @@ ---- -name: OpenSpec: Proposal -description: Scaffold a new OpenSpec change and validate strictly. -category: OpenSpec -tags: [openspec, change] ---- - -**Guardrails** -- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. -- Keep changes tightly scoped to the requested outcome. -- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. -- Identify any vague or ambiguous details and ask the necessary follow-up questions before editing files. -- Do not write any code during the proposal stage. Only create design documents (proposal.md, tasks.md, design.md, and spec deltas). Implementation happens in the apply stage after approval. - -**Steps** -1. Review `openspec/project.md`, run `openspec list` and `openspec list --specs`, and inspect related code or docs (e.g., via `rg`/`ls`) to ground the proposal in current behaviour; note any gaps that require clarification. -2. Choose a unique verb-led `change-id` and scaffold `proposal.md`, `tasks.md`, and `design.md` (when needed) under `openspec/changes//`. -3. Map the change into concrete capabilities or requirements, breaking multi-scope efforts into distinct spec deltas with clear relationships and sequencing. -4. Capture architectural reasoning in `design.md` when the solution spans multiple systems, introduces new patterns, or demands trade-off discussion before committing to specs. -5. Draft spec deltas in `changes//specs//spec.md` (one folder per capability) using `## ADDED|MODIFIED|REMOVED Requirements` with at least one `#### Scenario:` per requirement and cross-reference related capabilities when relevant. -6. Draft `tasks.md` as an ordered list of small, verifiable work items that deliver user-visible progress, include validation (tests, tooling), and highlight dependencies or parallelizable work. -7. Validate with `openspec validate --strict` and resolve every issue before sharing the proposal. - -**Reference** -- Use `openspec show --json --deltas-only` or `openspec show --type spec` to inspect details when validation fails. -- Search existing requirements with `rg -n "Requirement:|Scenario:" openspec/specs` before writing new ones. -- Explore the codebase with `rg `, `ls`, or direct file reads so proposals align with current implementation realities. - diff --git a/.env b/.env index 398acb0..deafdfb 100644 --- a/.env +++ b/.env @@ -1,82 +1,49 @@ -# Tool_OCR - Docker Environment Configuration -# Copy this file to .env when deploying with Docker +# Tool_OCR - Production/Docker Environment Configuration +# For local development, copy .env.example to .env.local and configure there +# +# This file is for Docker deployment or production use. +# Sensitive values should be set via environment variables or secrets management. # ===== Database Configuration ===== -MYSQL_HOST=mysql.theaken.com -MYSQL_PORT=33306 -MYSQL_USER=A060 -MYSQL_PASSWORD=WLeSCi0yhtc7 -MYSQL_DATABASE=db_A060 +# Set these via Docker secrets or environment variables in production +MYSQL_HOST=${MYSQL_HOST:-localhost} +MYSQL_PORT=${MYSQL_PORT:-3306} +MYSQL_USER=${MYSQL_USER:-} +MYSQL_PASSWORD=${MYSQL_PASSWORD:-} +MYSQL_DATABASE=${MYSQL_DATABASE:-} # ===== Application Configuration ===== -# External port (exposed to host) +# Production port (different from development) FRONTEND_PORT=12010 +BACKEND_PORT=8000 -# Security (IMPORTANT: Change SECRET_KEY in production!) -SECRET_KEY=your-secret-key-here-please-change-this-to-random-string +# Security - MUST be set via environment variable in production +SECRET_KEY=${SECRET_KEY:-change-this-in-production} ALGORITHM=HS256 ACCESS_TOKEN_EXPIRE_MINUTES=1440 +# ===== External Authentication Configuration ===== +EXTERNAL_AUTH_API_URL=${EXTERNAL_AUTH_API_URL:-https://your-auth-api.example.com} +EXTERNAL_AUTH_ENDPOINT=/api/auth/login +EXTERNAL_AUTH_TIMEOUT=30 + # ===== OCR Configuration ===== -# PaddleOCR model directory (inside container) -PADDLEOCR_MODEL_DIR=/app/backend/models/paddleocr -# Supported languages (comma-separated) OCR_LANGUAGES=ch,en,japan,korean -# Default confidence threshold OCR_CONFIDENCE_THRESHOLD=0.5 -# Maximum concurrent OCR workers MAX_OCR_WORKERS=4 -# ===== File Upload Configuration ===== -# Maximum file size in bytes (50MB default) +# ===== File Configuration ===== MAX_UPLOAD_SIZE=52428800 -# Allowed file extensions (comma-separated) ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx -# Upload directories (inside container) -UPLOAD_DIR=/app/backend/uploads -TEMP_DIR=/app/backend/uploads/temp -PROCESSED_DIR=/app/backend/uploads/processed -IMAGES_DIR=/app/backend/uploads/images -# ===== Export Configuration ===== -# Storage directories (inside container) -STORAGE_DIR=/app/backend/storage -MARKDOWN_DIR=/app/backend/storage/markdown -JSON_DIR=/app/backend/storage/json -EXPORTS_DIR=/app/backend/storage/exports - -# ===== PDF Generation Configuration ===== -# Pandoc path (inside container) -PANDOC_PATH=/usr/bin/pandoc -# Font directory (inside container) -FONT_DIR=/usr/share/fonts -# Default PDF page size -PDF_PAGE_SIZE=A4 -# Default PDF margins (mm) -PDF_MARGIN_TOP=20 -PDF_MARGIN_BOTTOM=20 -PDF_MARGIN_LEFT=20 -PDF_MARGIN_RIGHT=20 - -# ===== Translation Configuration (Reserved) ===== -# Enable translation feature (reserved for future) -ENABLE_TRANSLATION=false -# Translation engine: offline (argostranslate) or api (future) -TRANSLATION_ENGINE=offline -# Argostranslate models directory (inside container) -ARGOSTRANSLATE_MODELS_DIR=/app/backend/models/argostranslate - -# ===== Background Tasks Configuration ===== -# Task queue type: memory (default) or redis (future) -TASK_QUEUE_TYPE=memory -# Redis URL (if using redis) -# REDIS_URL=redis://localhost:6379/0 +# ===== Translation Configuration (DIFY API) ===== +ENABLE_TRANSLATION=${ENABLE_TRANSLATION:-false} +DIFY_BASE_URL=${DIFY_BASE_URL:-} +DIFY_API_KEY=${DIFY_API_KEY:-} +DIFY_TIMEOUT=120.0 # ===== CORS Configuration ===== -# Allowed origins (comma-separated, * for all) -# For Docker, use the external URL CORS_ORIGINS=http://localhost:12010,http://127.0.0.1:12010 # ===== Logging Configuration ===== LOG_LEVEL=INFO -LOG_FILE=/app/backend/logs/app.log diff --git a/.env.example b/.env.example index 7786765..b198128 100644 --- a/.env.example +++ b/.env.example @@ -1,7 +1,10 @@ # Tool_OCR - Environment Configuration Template # Copy this file to .env.local and fill in your actual values +# +# Note: Most path configurations have sensible defaults in config.py +# Only override if you need custom paths -# ===== Database Configuration ===== +# ===== Database Configuration (Required) ===== MYSQL_HOST=your-mysql-host MYSQL_PORT=3306 MYSQL_USER=your-username @@ -15,12 +18,12 @@ BACKEND_PORT=8000 FRONTEND_HOST=0.0.0.0 FRONTEND_PORT=5173 -# Security (generate a random string for production) +# Security (generate a random string for production: openssl rand -hex 32) SECRET_KEY=your-secret-key-here-please-change-this-to-random-string ALGORITHM=HS256 ACCESS_TOKEN_EXPIRE_MINUTES=1440 -# ===== External Authentication Configuration ===== +# ===== External Authentication Configuration (Required) ===== EXTERNAL_AUTH_API_URL=https://your-auth-api.example.com EXTERNAL_AUTH_ENDPOINT=/api/auth/login EXTERNAL_AUTH_TIMEOUT=30 @@ -46,61 +49,21 @@ GPU_DEVICE_ID=0 # ===== File Upload Configuration ===== MAX_UPLOAD_SIZE=52428800 ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx -UPLOAD_DIR=./uploads -TEMP_DIR=./uploads/temp -PROCESSED_DIR=./uploads/processed -IMAGES_DIR=./uploads/images - -# ===== Export Configuration ===== -STORAGE_DIR=./storage -MARKDOWN_DIR=./storage/markdown -JSON_DIR=./storage/json -EXPORTS_DIR=./storage/exports - -# ===== PDF Generation Configuration ===== -# Linux: /usr/bin/pandoc, macOS: /opt/homebrew/bin/pandoc -PANDOC_PATH=/usr/bin/pandoc -# Linux: /usr/share/fonts, macOS: /System/Library/Fonts -FONT_DIR=/usr/share/fonts -PDF_PAGE_SIZE=A4 -PDF_MARGIN_TOP=20 -PDF_MARGIN_BOTTOM=20 -PDF_MARGIN_LEFT=20 -PDF_MARGIN_RIGHT=20 +# Path defaults to backend/uploads - only override if needed +# UPLOAD_DIR=./uploads # ===== Translation Configuration (DIFY API) ===== -# Enable translation feature ENABLE_TRANSLATION=true -# DIFY API base URL DIFY_BASE_URL=https://your-dify-instance.example.com/v1 -# DIFY API key (get from DIFY dashboard) DIFY_API_KEY=your-dify-api-key -# API request timeout in seconds DIFY_TIMEOUT=120.0 -# Maximum retry attempts DIFY_MAX_RETRIES=3 -# Batch translation limits DIFY_MAX_BATCH_CHARS=5000 DIFY_MAX_BATCH_ITEMS=20 -# ===== Background Tasks Configuration ===== -TASK_QUEUE_TYPE=memory -# REDIS_URL=redis://localhost:6379/0 - # ===== CORS Configuration ===== CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173 # ===== Logging Configuration ===== LOG_LEVEL=INFO -LOG_FILE=./logs/app.log - -# ===== Development & Testing Configuration ===== -# Debug font path for visualization scripts -DEBUG_FONT_PATH=/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf -# Demo documents directory for testing -DEMO_DOCS_DIR=./demo_docs -# E2E test API base URL -E2E_API_BASE_URL=http://localhost:8000/api/v2 -# E2E test credentials (set in .env.local for security) -# E2E_TEST_USER_EMAIL=test@example.com -# E2E_TEST_USER_PASSWORD=testpassword +# LOG_FILE defaults to backend/logs/app.log diff --git a/AGENTS.md b/AGENTS.md deleted file mode 100644 index 0669699..0000000 --- a/AGENTS.md +++ /dev/null @@ -1,18 +0,0 @@ - -# OpenSpec Instructions - -These instructions are for AI assistants working in this project. - -Always open `@/openspec/AGENTS.md` when the request: -- Mentions planning or proposals (words like proposal, spec, change, plan) -- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work -- Sounds ambiguous and you need the authoritative spec before coding - -Use `@/openspec/AGENTS.md` to learn: -- How to create and apply change proposals -- Spec format and conventions -- Project structure and guidelines - -Keep this managed block so 'openspec update' can refresh the instructions. - - \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 0669699..0000000 --- a/CLAUDE.md +++ /dev/null @@ -1,18 +0,0 @@ - -# OpenSpec Instructions - -These instructions are for AI assistants working in this project. - -Always open `@/openspec/AGENTS.md` when the request: -- Mentions planning or proposals (words like proposal, spec, change, plan) -- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work -- Sounds ambiguous and you need the authoritative spec before coding - -Use `@/openspec/AGENTS.md` to learn: -- How to create and apply change proposals -- Spec format and conventions -- Project structure and guidelines - -Keep this managed block so 'openspec update' can refresh the instructions. - - \ No newline at end of file diff --git a/openspec/AGENTS.md b/openspec/AGENTS.md deleted file mode 100644 index 96ab0bb..0000000 --- a/openspec/AGENTS.md +++ /dev/null @@ -1,456 +0,0 @@ -# OpenSpec Instructions - -Instructions for AI coding assistants using OpenSpec for spec-driven development. - -## TL;DR Quick Checklist - -- Search existing work: `openspec spec list --long`, `openspec list` (use `rg` only for full-text search) -- Decide scope: new capability vs modify existing capability -- Pick a unique `change-id`: kebab-case, verb-led (`add-`, `update-`, `remove-`, `refactor-`) -- Scaffold: `proposal.md`, `tasks.md`, `design.md` (only if needed), and delta specs per affected capability -- Write deltas: use `## ADDED|MODIFIED|REMOVED|RENAMED Requirements`; include at least one `#### Scenario:` per requirement -- Validate: `openspec validate [change-id] --strict` and fix issues -- Request approval: Do not start implementation until proposal is approved - -## Three-Stage Workflow - -### Stage 1: Creating Changes -Create proposal when you need to: -- Add features or functionality -- Make breaking changes (API, schema) -- Change architecture or patterns -- Optimize performance (changes behavior) -- Update security patterns - -Triggers (examples): -- "Help me create a change proposal" -- "Help me plan a change" -- "Help me create a proposal" -- "I want to create a spec proposal" -- "I want to create a spec" - -Loose matching guidance: -- Contains one of: `proposal`, `change`, `spec` -- With one of: `create`, `plan`, `make`, `start`, `help` - -Skip proposal for: -- Bug fixes (restore intended behavior) -- Typos, formatting, comments -- Dependency updates (non-breaking) -- Configuration changes -- Tests for existing behavior - -**Workflow** -1. Review `openspec/project.md`, `openspec list`, and `openspec list --specs` to understand current context. -2. Choose a unique verb-led `change-id` and scaffold `proposal.md`, `tasks.md`, optional `design.md`, and spec deltas under `openspec/changes//`. -3. Draft spec deltas using `## ADDED|MODIFIED|REMOVED Requirements` with at least one `#### Scenario:` per requirement. -4. Run `openspec validate --strict` and resolve any issues before sharing the proposal. - -### Stage 2: Implementing Changes -Track these steps as TODOs and complete them one by one. -1. **Read proposal.md** - Understand what's being built -2. **Read design.md** (if exists) - Review technical decisions -3. **Read tasks.md** - Get implementation checklist -4. **Implement tasks sequentially** - Complete in order -5. **Confirm completion** - Ensure every item in `tasks.md` is finished before updating statuses -6. **Update checklist** - After all work is done, set every task to `- [x]` so the list reflects reality -7. **Approval gate** - Do not start implementation until the proposal is reviewed and approved - -### Stage 3: Archiving Changes -After deployment, create separate PR to: -- Move `changes/[name]/` → `changes/archive/YYYY-MM-DD-[name]/` -- Update `specs/` if capabilities changed -- Use `openspec archive --skip-specs --yes` for tooling-only changes (always pass the change ID explicitly) -- Run `openspec validate --strict` to confirm the archived change passes checks - -## Before Any Task - -**Context Checklist:** -- [ ] Read relevant specs in `specs/[capability]/spec.md` -- [ ] Check pending changes in `changes/` for conflicts -- [ ] Read `openspec/project.md` for conventions -- [ ] Run `openspec list` to see active changes -- [ ] Run `openspec list --specs` to see existing capabilities - -**Before Creating Specs:** -- Always check if capability already exists -- Prefer modifying existing specs over creating duplicates -- Use `openspec show [spec]` to review current state -- If request is ambiguous, ask 1–2 clarifying questions before scaffolding - -### Search Guidance -- Enumerate specs: `openspec spec list --long` (or `--json` for scripts) -- Enumerate changes: `openspec list` (or `openspec change list --json` - deprecated but available) -- Show details: - - Spec: `openspec show --type spec` (use `--json` for filters) - - Change: `openspec show --json --deltas-only` -- Full-text search (use ripgrep): `rg -n "Requirement:|Scenario:" openspec/specs` - -## Quick Start - -### CLI Commands - -```bash -# Essential commands -openspec list # List active changes -openspec list --specs # List specifications -openspec show [item] # Display change or spec -openspec validate [item] # Validate changes or specs -openspec archive [--yes|-y] # Archive after deployment (add --yes for non-interactive runs) - -# Project management -openspec init [path] # Initialize OpenSpec -openspec update [path] # Update instruction files - -# Interactive mode -openspec show # Prompts for selection -openspec validate # Bulk validation mode - -# Debugging -openspec show [change] --json --deltas-only -openspec validate [change] --strict -``` - -### Command Flags - -- `--json` - Machine-readable output -- `--type change|spec` - Disambiguate items -- `--strict` - Comprehensive validation -- `--no-interactive` - Disable prompts -- `--skip-specs` - Archive without spec updates -- `--yes`/`-y` - Skip confirmation prompts (non-interactive archive) - -## Directory Structure - -``` -openspec/ -├── project.md # Project conventions -├── specs/ # Current truth - what IS built -│ └── [capability]/ # Single focused capability -│ ├── spec.md # Requirements and scenarios -│ └── design.md # Technical patterns -├── changes/ # Proposals - what SHOULD change -│ ├── [change-name]/ -│ │ ├── proposal.md # Why, what, impact -│ │ ├── tasks.md # Implementation checklist -│ │ ├── design.md # Technical decisions (optional; see criteria) -│ │ └── specs/ # Delta changes -│ │ └── [capability]/ -│ │ └── spec.md # ADDED/MODIFIED/REMOVED -│ └── archive/ # Completed changes -``` - -## Creating Change Proposals - -### Decision Tree - -``` -New request? -├─ Bug fix restoring spec behavior? → Fix directly -├─ Typo/format/comment? → Fix directly -├─ New feature/capability? → Create proposal -├─ Breaking change? → Create proposal -├─ Architecture change? → Create proposal -└─ Unclear? → Create proposal (safer) -``` - -### Proposal Structure - -1. **Create directory:** `changes/[change-id]/` (kebab-case, verb-led, unique) - -2. **Write proposal.md:** -```markdown -# Change: [Brief description of change] - -## Why -[1-2 sentences on problem/opportunity] - -## What Changes -- [Bullet list of changes] -- [Mark breaking changes with **BREAKING**] - -## Impact -- Affected specs: [list capabilities] -- Affected code: [key files/systems] -``` - -3. **Create spec deltas:** `specs/[capability]/spec.md` -```markdown -## ADDED Requirements -### Requirement: New Feature -The system SHALL provide... - -#### Scenario: Success case -- **WHEN** user performs action -- **THEN** expected result - -## MODIFIED Requirements -### Requirement: Existing Feature -[Complete modified requirement] - -## REMOVED Requirements -### Requirement: Old Feature -**Reason**: [Why removing] -**Migration**: [How to handle] -``` -If multiple capabilities are affected, create multiple delta files under `changes/[change-id]/specs//spec.md`—one per capability. - -4. **Create tasks.md:** -```markdown -## 1. Implementation -- [ ] 1.1 Create database schema -- [ ] 1.2 Implement API endpoint -- [ ] 1.3 Add frontend component -- [ ] 1.4 Write tests -``` - -5. **Create design.md when needed:** -Create `design.md` if any of the following apply; otherwise omit it: -- Cross-cutting change (multiple services/modules) or a new architectural pattern -- New external dependency or significant data model changes -- Security, performance, or migration complexity -- Ambiguity that benefits from technical decisions before coding - -Minimal `design.md` skeleton: -```markdown -## Context -[Background, constraints, stakeholders] - -## Goals / Non-Goals -- Goals: [...] -- Non-Goals: [...] - -## Decisions -- Decision: [What and why] -- Alternatives considered: [Options + rationale] - -## Risks / Trade-offs -- [Risk] → Mitigation - -## Migration Plan -[Steps, rollback] - -## Open Questions -- [...] -``` - -## Spec File Format - -### Critical: Scenario Formatting - -**CORRECT** (use #### headers): -```markdown -#### Scenario: User login success -- **WHEN** valid credentials provided -- **THEN** return JWT token -``` - -**WRONG** (don't use bullets or bold): -```markdown -- **Scenario: User login** ❌ -**Scenario**: User login ❌ -### Scenario: User login ❌ -``` - -Every requirement MUST have at least one scenario. - -### Requirement Wording -- Use SHALL/MUST for normative requirements (avoid should/may unless intentionally non-normative) - -### Delta Operations - -- `## ADDED Requirements` - New capabilities -- `## MODIFIED Requirements` - Changed behavior -- `## REMOVED Requirements` - Deprecated features -- `## RENAMED Requirements` - Name changes - -Headers matched with `trim(header)` - whitespace ignored. - -#### When to use ADDED vs MODIFIED -- ADDED: Introduces a new capability or sub-capability that can stand alone as a requirement. Prefer ADDED when the change is orthogonal (e.g., adding "Slash Command Configuration") rather than altering the semantics of an existing requirement. -- MODIFIED: Changes the behavior, scope, or acceptance criteria of an existing requirement. Always paste the full, updated requirement content (header + all scenarios). The archiver will replace the entire requirement with what you provide here; partial deltas will drop previous details. -- RENAMED: Use when only the name changes. If you also change behavior, use RENAMED (name) plus MODIFIED (content) referencing the new name. - -Common pitfall: Using MODIFIED to add a new concern without including the previous text. This causes loss of detail at archive time. If you aren’t explicitly changing the existing requirement, add a new requirement under ADDED instead. - -Authoring a MODIFIED requirement correctly: -1) Locate the existing requirement in `openspec/specs//spec.md`. -2) Copy the entire requirement block (from `### Requirement: ...` through its scenarios). -3) Paste it under `## MODIFIED Requirements` and edit to reflect the new behavior. -4) Ensure the header text matches exactly (whitespace-insensitive) and keep at least one `#### Scenario:`. - -Example for RENAMED: -```markdown -## RENAMED Requirements -- FROM: `### Requirement: Login` -- TO: `### Requirement: User Authentication` -``` - -## Troubleshooting - -### Common Errors - -**"Change must have at least one delta"** -- Check `changes/[name]/specs/` exists with .md files -- Verify files have operation prefixes (## ADDED Requirements) - -**"Requirement must have at least one scenario"** -- Check scenarios use `#### Scenario:` format (4 hashtags) -- Don't use bullet points or bold for scenario headers - -**Silent scenario parsing failures** -- Exact format required: `#### Scenario: Name` -- Debug with: `openspec show [change] --json --deltas-only` - -### Validation Tips - -```bash -# Always use strict mode for comprehensive checks -openspec validate [change] --strict - -# Debug delta parsing -openspec show [change] --json | jq '.deltas' - -# Check specific requirement -openspec show [spec] --json -r 1 -``` - -## Happy Path Script - -```bash -# 1) Explore current state -openspec spec list --long -openspec list -# Optional full-text search: -# rg -n "Requirement:|Scenario:" openspec/specs -# rg -n "^#|Requirement:" openspec/changes - -# 2) Choose change id and scaffold -CHANGE=add-two-factor-auth -mkdir -p openspec/changes/$CHANGE/{specs/auth} -printf "## Why\n...\n\n## What Changes\n- ...\n\n## Impact\n- ...\n" > openspec/changes/$CHANGE/proposal.md -printf "## 1. Implementation\n- [ ] 1.1 ...\n" > openspec/changes/$CHANGE/tasks.md - -# 3) Add deltas (example) -cat > openspec/changes/$CHANGE/specs/auth/spec.md << 'EOF' -## ADDED Requirements -### Requirement: Two-Factor Authentication -Users MUST provide a second factor during login. - -#### Scenario: OTP required -- **WHEN** valid credentials are provided -- **THEN** an OTP challenge is required -EOF - -# 4) Validate -openspec validate $CHANGE --strict -``` - -## Multi-Capability Example - -``` -openspec/changes/add-2fa-notify/ -├── proposal.md -├── tasks.md -└── specs/ - ├── auth/ - │ └── spec.md # ADDED: Two-Factor Authentication - └── notifications/ - └── spec.md # ADDED: OTP email notification -``` - -auth/spec.md -```markdown -## ADDED Requirements -### Requirement: Two-Factor Authentication -... -``` - -notifications/spec.md -```markdown -## ADDED Requirements -### Requirement: OTP Email Notification -... -``` - -## Best Practices - -### Simplicity First -- Default to <100 lines of new code -- Single-file implementations until proven insufficient -- Avoid frameworks without clear justification -- Choose boring, proven patterns - -### Complexity Triggers -Only add complexity with: -- Performance data showing current solution too slow -- Concrete scale requirements (>1000 users, >100MB data) -- Multiple proven use cases requiring abstraction - -### Clear References -- Use `file.ts:42` format for code locations -- Reference specs as `specs/auth/spec.md` -- Link related changes and PRs - -### Capability Naming -- Use verb-noun: `user-auth`, `payment-capture` -- Single purpose per capability -- 10-minute understandability rule -- Split if description needs "AND" - -### Change ID Naming -- Use kebab-case, short and descriptive: `add-two-factor-auth` -- Prefer verb-led prefixes: `add-`, `update-`, `remove-`, `refactor-` -- Ensure uniqueness; if taken, append `-2`, `-3`, etc. - -## Tool Selection Guide - -| Task | Tool | Why | -|------|------|-----| -| Find files by pattern | Glob | Fast pattern matching | -| Search code content | Grep | Optimized regex search | -| Read specific files | Read | Direct file access | -| Explore unknown scope | Task | Multi-step investigation | - -## Error Recovery - -### Change Conflicts -1. Run `openspec list` to see active changes -2. Check for overlapping specs -3. Coordinate with change owners -4. Consider combining proposals - -### Validation Failures -1. Run with `--strict` flag -2. Check JSON output for details -3. Verify spec file format -4. Ensure scenarios properly formatted - -### Missing Context -1. Read project.md first -2. Check related specs -3. Review recent archives -4. Ask for clarification - -## Quick Reference - -### Stage Indicators -- `changes/` - Proposed, not yet built -- `specs/` - Built and deployed -- `archive/` - Completed changes - -### File Purposes -- `proposal.md` - Why and what -- `tasks.md` - Implementation steps -- `design.md` - Technical decisions -- `spec.md` - Requirements and behavior - -### CLI Essentials -```bash -openspec list # What's in progress? -openspec show [item] # View details -openspec validate --strict # Is it correct? -openspec archive [--yes|-y] # Mark complete (add --yes for automation) -``` - -Remember: Specs are truth. Changes are proposals. Keep them in sync. diff --git a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/proposal.md b/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/proposal.md deleted file mode 100644 index 0be756d..0000000 --- a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/proposal.md +++ /dev/null @@ -1,24 +0,0 @@ -# Change: Fix V2 API UI Integration Issues - -## Why -After migrating from V1 batch-based architecture to V2 task-based architecture, several UI pages still reference V1 APIs or have incomplete implementations: -1. Results page (http://127.0.0.1:5173/results) doesn't display task details - uses non-existent V1 `getBatchStatus` API -2. Task History page markdown downloads produce empty files (0 bytes) - OCR service not generating markdown content -3. Task History page "View Details" button navigates to `/tasks/{taskId}` route that doesn't exist -4. Export page (http://127.0.0.1:5173/export) uses non-existent V1 `/api/v2/export` endpoint (404) and lacks multi-task selection -5. Admin Dashboard page loads but may have permission or API issues - -These issues were discovered during testing with task ID: `88c6c2d2-37e1-48fd-a50f-406142987bdf` using file `Henkel-84-1LMISR4 (漢高).pdf`. - -## What Changes -- Migrate ResultsPage from V1 batch API to V2 task API -- Fix OCR service markdown generation to produce non-empty .md files -- Add task detail page route and component at `/tasks/:taskId` -- Update ExportPage to use V2 download endpoints and support multi-task selection -- Verify and fix Admin Dashboard API integration and permissions - -## Impact -- Affected specs: task-management, result-export -- Affected code: - - Frontend: `src/pages/ResultsPage.tsx`, `src/pages/ExportPage.tsx`, `src/App.tsx` (routes), new `src/pages/TaskDetailPage.tsx` - - Backend: `app/services/ocr_service.py` (markdown generation), `app/routers/tasks.py` (download endpoints) diff --git a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/result-export/spec.md b/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/result-export/spec.md deleted file mode 100644 index 6df518a..0000000 --- a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/result-export/spec.md +++ /dev/null @@ -1,46 +0,0 @@ -# Result Export - Delta Changes - -## ADDED Requirements - -### Requirement: Export Interface -The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs. - -#### Scenario: Export page uses V2 download endpoints -- **WHEN** user selects a format and clicks export button -- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}` -- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404) -- **AND** file SHALL download successfully - -#### Scenario: Export supports multiple formats -- **WHEN** user exports a completed task -- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF -- **AND** each format SHALL use correct V2 download endpoint -- **AND** downloaded files SHALL contain task OCR results - -### Requirement: Multi-Task Export Selection -The Export page SHALL allow users to select and export multiple tasks. - -#### Scenario: Select multiple tasks for export -- **WHEN** Export page loads -- **THEN** page SHALL display list of user's completed tasks -- **AND** page SHALL provide checkboxes to select multiple tasks -- **AND** page SHALL NOT require batch ID from upload store (legacy V1 behavior) - -#### Scenario: Export selected tasks -- **WHEN** user selects multiple tasks and clicks export -- **THEN** system SHALL download each selected task's results in chosen format -- **AND** downloaded files SHALL be named distinctly (e.g., `{task_id}_result.{ext}`) -- **AND** system MAY provide option to download as ZIP archive for multiple files - -### Requirement: Export Configuration Persistence -Export settings (format, thresholds, templates) SHALL apply consistently to V2 task downloads. - -#### Scenario: Apply confidence threshold to export -- **WHEN** user sets confidence threshold to 0.7 and exports -- **THEN** downloaded results SHALL only include OCR text with confidence >= 0.7 -- **AND** threshold SHALL apply via V2 download endpoint query parameters - -#### Scenario: Apply CSS template to PDF export -- **WHEN** user selects CSS template for PDF format -- **THEN** downloaded PDF SHALL use selected styling -- **AND** template SHALL be passed to V2 `/tasks/{id}/download/pdf` endpoint diff --git a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/task-management/spec.md b/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/task-management/spec.md deleted file mode 100644 index 7dffa3d..0000000 --- a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/specs/task-management/spec.md +++ /dev/null @@ -1,51 +0,0 @@ -# Task Management - Delta Changes - -## ADDED Requirements - -### Requirement: Task Result Generation -The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content. - -#### Scenario: Markdown file contains OCR results -- **WHEN** a task completes OCR processing successfully -- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format -- **AND** the file size SHALL be greater than 0 bytes -- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection - -#### Scenario: Result files stored in task directory -- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf` -- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/` -- **AND** both `_result.json` and `_result.md` SHALL exist -- **AND** both files SHALL contain valid OCR output data - -### Requirement: Task Detail View -The frontend SHALL provide a dedicated page for viewing individual task details. - -#### Scenario: Navigate to task detail page -- **WHEN** user clicks "View Details" button on task in Task History page -- **THEN** browser SHALL navigate to `/tasks/{task_id}` -- **AND** TaskDetailPage component SHALL render - -#### Scenario: Display task information -- **WHEN** TaskDetailPage loads for a valid task ID -- **THEN** page SHALL display task metadata (filename, status, processing time, confidence) -- **AND** page SHALL show markdown preview of OCR results -- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats - -#### Scenario: Download from task detail page -- **WHEN** user clicks download button for a specific format -- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint -- **AND** downloaded file SHALL contain the task's OCR results in requested format - -### Requirement: Results Page V2 Migration -The Results page SHALL use V2 task-based APIs instead of V1 batch APIs. - -#### Scenario: Load task results instead of batch -- **WHEN** Results page loads with a task ID in upload store -- **THEN** page SHALL call `apiClientV2.getTask(taskId)` to fetch task details -- **AND** page SHALL NOT call any V1 batch status endpoints -- **AND** task information SHALL display correctly - -#### Scenario: Handle missing task gracefully -- **WHEN** Results page loads without a task ID -- **THEN** page SHALL display helpful message directing user to upload page -- **AND** page SHALL provide button to navigate to `/upload` diff --git a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/tasks.md b/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/tasks.md deleted file mode 100644 index ec75548..0000000 --- a/openspec/changes/archive/2025-11-17-fix-v2-api-ui-issues/tasks.md +++ /dev/null @@ -1,40 +0,0 @@ -# Implementation Tasks - -## 1. Backend Fixes -- [x] 1.1 Fix markdown generation in OCR service to produce non-empty content -- [x] 1.2 Verify download endpoints (/tasks/{id}/download/json, markdown, pdf) work correctly -- [x] 1.3 Verify admin API endpoints (/admin/stats, /admin/users, /admin/users/top) exist and work -- [x] 1.4 Test markdown file generation with sample task - -## 2. Frontend - Results Page Migration -- [x] 2.1 Remove V1 API imports from ResultsPage.tsx -- [x] 2.2 Replace `getBatchStatus(batchId)` with V2 task API calls -- [x] 2.3 Update component to work with task data structure instead of batch -- [x] 2.4 Test Results page displays task information correctly - -## 3. Frontend - Task Detail Page -- [x] 3.1 Create TaskDetailPage.tsx component -- [x] 3.2 Add route `/tasks/:taskId` in App.tsx -- [x] 3.3 Implement task detail view with markdown preview -- [x] 3.4 Add download buttons for JSON, Markdown, PDF -- [x] 3.5 Test navigation from Task History page - -## 4. Frontend - Export Page Refactor -- [x] 4.1 Replace V1 `apiClient.exportResults` with V2 download endpoints -- [x] 4.2 Add task selection UI (replace single batch ID input) -- [x] 4.3 Implement multi-task download functionality -- [x] 4.4 Update export button handlers to use V2 APIs -- [x] 4.5 Test all export formats (TXT, JSON, Excel, Markdown, PDF) - -## 5. Admin Dashboard Verification -- [x] 5.1 Test admin page with admin user credentials -- [x] 5.2 Verify API calls return data successfully -- [x] 5.3 Check permission requirements in ProtectedRoute component -- [x] 5.4 Fix any permission or API issues discovered - -## 6. Testing -- [ ] 6.1 Test complete workflow: Upload → Process → View Results → Download -- [ ] 6.2 Verify markdown files contain actual OCR content -- [ ] 6.3 Test task detail navigation and display -- [ ] 6.4 Test multi-format exports from Export page -- [ ] 6.5 Test Admin Dashboard with admin account diff --git a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/proposal.md b/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/proposal.md deleted file mode 100644 index cd8bc10..0000000 --- a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/proposal.md +++ /dev/null @@ -1,84 +0,0 @@ -# Change: Add GPU Acceleration Support for OCR Processing - -## Why -PaddleOCR supports CUDA GPU acceleration which can significantly improve OCR processing speed for batch operations. Currently, the system always uses CPU processing, which is slower and less efficient for large document batches. By adding GPU detection and automatic CUDA support, the system will: -- Automatically utilize available GPU hardware when present -- Fall back gracefully to CPU processing when GPU is unavailable -- Reduce processing time for large batches by leveraging parallel GPU computation -- Improve overall system throughput and user experience - -## What Changes -- Add GPU detection logic to environment setup script (`setup_dev_env.sh`) -- Automatically install CUDA-enabled PaddlePaddle when compatible GPU is detected -- Install CPU-only PaddlePaddle when no compatible GPU is found -- Add GPU availability detection in OCR processing code -- Automatically enable GPU acceleration in PaddleOCR when GPU is available -- Add configuration option to force CPU mode (for testing or troubleshooting) -- Add GPU status reporting in API health check endpoint -- Update documentation with GPU requirements and setup instructions - -## Impact -- **Affected capabilities**: - - `ocr-processing`: Add GPU acceleration support with automatic detection - - `environment-setup`: Add GPU detection and CUDA installation logic - -- **Affected code**: - - `setup_dev_env.sh`: GPU detection and conditional CUDA package installation - - `backend/app/services/ocr_service.py`: GPU availability detection and configuration - - `backend/app/api/v1/endpoints/health.py`: GPU status reporting - - `backend/app/core/config.py`: GPU configuration settings - - `.env.local`: GPU-related environment variables - -- **Dependencies**: - - When GPU available: `paddlepaddle-gpu` (with matching CUDA version) - - When GPU unavailable: `paddlepaddle` (CPU-only, current default) - - Detection tools: `nvidia-smi` (NVIDIA GPUs), `lspci` (hardware detection) - -- **Configuration**: - - New env var: `FORCE_CPU_MODE` (default: false) - Override GPU detection - - New env var: `CUDA_VERSION` (auto-detected or manual override) - - GPU memory allocation settings for PaddleOCR - - Batch size adjustment based on GPU memory availability - -- **Performance Impact**: - - Expected 3-10x speedup for OCR processing on GPU-enabled systems - - No performance degradation on CPU-only systems (same as current behavior) - - Automatic memory management to prevent GPU OOM errors - -- **Backward Compatibility**: - - Fully backward compatible - existing CPU-only installations continue to work - - No breaking changes to API or configuration - - Existing installations can opt-in by re-running setup script on GPU-enabled hardware - -## Known Issues and Limitations - -### ~~Chart Recognition Feature Disabled~~ ✅ **RESOLVED** (2025-11-16) - -**Previous Issue**: Chart recognition feature in PP-StructureV3 was disabled due to API incompatibility with PaddlePaddle 3.0.0. - -**Resolution**: -- **Fixed in**: PaddlePaddle 3.2.1 (released 2025-10-30) -- **Current Status**: ✅ Chart recognition **FULLY ENABLED** -- **API Status**: `paddle.incubate.nn.functional.fused_rms_norm_ext` now available -- **Documentation**: See [CHART_RECOGNITION.md](../../../CHART_RECOGNITION.md) for details - -**Root Cause** (Historical): -- PaddleOCR-VL chart recognition model requires `paddle.incubate.nn.functional.fused_rms_norm_ext` API -- PaddlePaddle 3.0.0 stable only provided `fused_rms_norm` (base version) -- The extended version `fused_rms_norm_ext` was not available in 3.0.0 - -**Current Capabilities** (✅ All Enabled): -- ✅ Layout analysis detects and extracts chart/figure regions as images -- ✅ Tables, formulas, and text recognition function normally -- ✅ **Deep chart understanding** (chart type detection, data extraction, axis/legend parsing) -- ✅ **Converting chart content to structured data** (JSON, tables) - -**Actions Taken**: -- Upgraded system to PaddlePaddle 3.2.1+ -- Enabled chart recognition in PP-StructureV3 initialization -- Configured WSL CUDA library paths for GPU support -- Updated all documentation to reflect enabled status - -**Code Location**: [backend/app/services/ocr_service.py:217](../../backend/app/services/ocr_service.py#L217) - -**Status**: ✅ **RESOLVED** - Chart recognition fully operational diff --git a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/environment-setup/spec.md b/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/environment-setup/spec.md deleted file mode 100644 index 150b19b..0000000 --- a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/environment-setup/spec.md +++ /dev/null @@ -1,77 +0,0 @@ -# Environment Setup Specification - -## ADDED Requirements - -### Requirement: GPU Detection and CUDA Installation -The system SHALL automatically detect compatible GPU hardware during environment setup and install appropriate PaddlePaddle packages (GPU-enabled or CPU-only) based on hardware availability. - -#### Scenario: GPU detected with CUDA support -- **WHEN** setup script runs on system with NVIDIA GPU and CUDA drivers -- **THEN** the script detects GPU using `nvidia-smi` command -- **AND** determines CUDA version from driver -- **AND** installs `paddlepaddle-gpu` with matching CUDA version -- **AND** verifies GPU availability through Python -- **AND** displays GPU information (device name, CUDA version, memory) - -#### Scenario: No GPU detected -- **WHEN** setup script runs on system without compatible GPU -- **THEN** the script detects absence of GPU hardware -- **AND** installs CPU-only `paddlepaddle` package -- **AND** displays message that CPU mode will be used -- **AND** continues setup without errors - -#### Scenario: GPU detected but no CUDA drivers -- **WHEN** setup script detects NVIDIA GPU but CUDA drivers are missing -- **THEN** the script displays warning about missing drivers -- **AND** provides installation instructions for CUDA drivers -- **AND** falls back to CPU-only installation -- **AND** suggests re-running setup after driver installation - -#### Scenario: CUDA version mismatch -- **WHEN** detected CUDA version is not compatible with available PaddlePaddle packages -- **THEN** the script displays available CUDA versions -- **AND** installs closest compatible PaddlePaddle GPU package -- **AND** warns user about potential compatibility issues -- **AND** provides instructions to upgrade/downgrade CUDA if needed - -#### Scenario: Manual CUDA version override -- **WHEN** user sets CUDA_VERSION environment variable before running setup -- **THEN** the script uses specified CUDA version instead of auto-detection -- **AND** installs corresponding PaddlePaddle GPU package -- **AND** skips automatic CUDA detection -- **AND** displays warning if specified version differs from detected version - -### Requirement: GPU Verification -The system SHALL verify GPU functionality after installation and provide clear status reporting. - -#### Scenario: Successful GPU setup verification -- **WHEN** PaddlePaddle GPU installation completes -- **THEN** the script runs GPU availability test using Python -- **AND** confirms CUDA devices are accessible -- **AND** displays GPU count, device names, and memory capacity -- **AND** marks GPU setup as successful - -#### Scenario: GPU verification fails -- **WHEN** GPU verification test fails after installation -- **THEN** the script displays detailed error message -- **AND** provides troubleshooting steps -- **AND** suggests fallback to CPU mode -- **AND** does not fail entire setup process - -### Requirement: Environment Configuration for GPU -The system SHALL create appropriate configuration settings for GPU usage in environment files. - -#### Scenario: GPU-enabled configuration -- **WHEN** GPU is successfully detected and verified -- **THEN** the setup script adds GPU settings to `.env.local` -- **AND** sets `FORCE_CPU_MODE=false` -- **AND** sets detected `CUDA_VERSION` -- **AND** sets recommended `GPU_MEMORY_FRACTION` (e.g., 0.8) -- **AND** adds GPU-related comments and documentation - -#### Scenario: CPU-only configuration -- **WHEN** no GPU is detected or verification fails -- **THEN** the setup script creates CPU-only configuration -- **AND** sets `FORCE_CPU_MODE=true` -- **AND** omits or comments out GPU-specific settings -- **AND** adds note about GPU requirements diff --git a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/ocr-processing/spec.md deleted file mode 100644 index 264797c..0000000 --- a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/specs/ocr-processing/spec.md +++ /dev/null @@ -1,89 +0,0 @@ -# OCR Processing Specification - -## ADDED Requirements - -### Requirement: GPU Acceleration -The system SHALL automatically detect and utilize GPU hardware for OCR processing when available, with graceful fallback to CPU mode when GPU is unavailable or disabled. - -#### Scenario: GPU available and enabled -- **WHEN** PaddleOCR service initializes on system with compatible GPU -- **THEN** the system detects GPU availability using CUDA runtime -- **AND** initializes PaddleOCR with `use_gpu=True` parameter -- **AND** sets appropriate GPU memory fraction to prevent OOM errors -- **AND** logs GPU device information (name, memory, CUDA version) -- **AND** processes OCR tasks using GPU acceleration - -#### Scenario: CPU fallback when GPU unavailable -- **WHEN** PaddleOCR service initializes on system without GPU -- **THEN** the system detects absence of GPU -- **AND** initializes PaddleOCR with `use_gpu=False` parameter -- **AND** logs CPU mode status -- **AND** processes OCR tasks using CPU without errors - -#### Scenario: Force CPU mode override -- **WHEN** FORCE_CPU_MODE environment variable is set to true -- **THEN** the system ignores GPU availability -- **AND** initializes PaddleOCR in CPU mode -- **AND** logs that CPU mode is forced by configuration -- **AND** processes OCR tasks using CPU - -#### Scenario: GPU out-of-memory error handling -- **WHEN** GPU runs out of memory during OCR processing -- **THEN** the system catches CUDA OOM exception -- **AND** logs error with GPU memory information -- **AND** attempts to process the task using CPU mode -- **AND** continues batch processing without failure -- **AND** records GPU failure in task metadata - -#### Scenario: Multiple GPU devices available -- **WHEN** system has multiple CUDA devices -- **THEN** the system detects all available GPUs -- **AND** uses primary GPU (device 0) by default -- **AND** allows GPU device selection via configuration -- **AND** logs selected GPU device information - -### Requirement: GPU Performance Optimization -The system SHALL optimize GPU memory usage and batch processing for efficient OCR performance. - -#### Scenario: Automatic batch size adjustment -- **WHEN** GPU mode is enabled -- **THEN** the system queries available GPU memory -- **AND** calculates optimal batch size based on memory capacity -- **AND** adjusts concurrent processing threads accordingly -- **AND** monitors memory usage during processing -- **AND** prevents memory allocation beyond safe threshold - -#### Scenario: GPU memory management -- **WHEN** GPU memory fraction is configured -- **THEN** the system allocates specified fraction of total GPU memory -- **AND** reserves memory for PaddleOCR model -- **AND** prevents other processes from causing OOM -- **AND** releases memory after batch completion - -### Requirement: GPU Status Reporting -The system SHALL provide GPU status information through health check API and logging. - -#### Scenario: Health check with GPU available -- **WHEN** client requests `/health` endpoint on GPU-enabled system -- **THEN** the system returns health status including: - - `gpu_available`: true - - `gpu_device_name`: detected GPU name - - `cuda_version`: CUDA runtime version - - `gpu_memory_total`: total GPU memory in MB - - `gpu_memory_used`: currently used GPU memory in MB - - `gpu_utilization`: current GPU utilization percentage - -#### Scenario: Health check without GPU -- **WHEN** client requests `/health` endpoint on CPU-only system -- **THEN** the system returns health status including: - - `gpu_available`: false - - `processing_mode`: "CPU" - - `reason`: explanation for CPU mode (e.g., "No GPU detected", "CPU mode forced") - -#### Scenario: Startup GPU status logging -- **WHEN** OCR service starts -- **THEN** the system logs GPU detection results -- **AND** logs selected processing mode (GPU/CPU) -- **AND** logs GPU device details if available -- **AND** logs any GPU-related warnings or errors -- **AND** continues startup successfully regardless of GPU status diff --git a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/tasks.md b/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/tasks.md deleted file mode 100644 index faeddbc..0000000 --- a/openspec/changes/archive/2025-11-18-add-gpu-acceleration-support/tasks.md +++ /dev/null @@ -1,103 +0,0 @@ -# Implementation Tasks - -## 1. Environment Setup Enhancement -- [x] 1.1 Add GPU detection function in `setup_dev_env.sh` - - Detect NVIDIA GPU using `nvidia-smi` or `lspci` - - Detect CUDA version if GPU is available - - Output GPU detection results to user -- [x] 1.2 Add conditional CUDA package installation - - Install `paddlepaddle-gpu` with matching CUDA version when GPU detected - - Install `paddlepaddle` (CPU-only) when no GPU detected - - Handle different CUDA versions (11.x, 12.x, 13.x) -- [x] 1.3 Add GPU verification step after installation - - Test PaddlePaddle GPU availability - - Report GPU status and CUDA version to user - - Provide fallback instructions if GPU setup fails - -## 2. Configuration Updates -- [x] 2.1 Add GPU configuration to `.env.local` - - Add `FORCE_CPU_MODE` option (default: false) - - Add `GPU_DEVICE_ID` for device selection - - Add `GPU_MEMORY_FRACTION` for memory allocation control -- [x] 2.2 Update backend configuration - - Add GPU settings to `backend/app/core/config.py` - - Load GPU-related environment variables - - Add validation for GPU configuration values - -## 3. OCR Service GPU Integration -- [x] 3.1 Add GPU detection in OCR service initialization - - Create GPU availability check function - - Detect available GPU devices - - Log GPU status (available/unavailable, device name, memory) -- [x] 3.2 Implement automatic GPU/CPU mode selection - - Enable GPU mode in PaddleOCR when GPU is available - - Fall back to CPU mode when GPU is unavailable or forced - - Use global device setting via `paddle.set_device()` for PaddleOCR 3.x -- [x] 3.3 Add GPU memory management - - Set GPU memory fraction to prevent OOM errors - - Detect GPU memory and compute capability - - Handle GPU memory allocation failures gracefully -- [x] 3.4 Update `backend/app/services/ocr_service.py` - - Modify PaddleOCR initialization for PaddleOCR 3.x API - - Add GPU status logging - - Add error handling for GPU-related issues - -## 4. Health Check and Monitoring -- [x] 4.1 Add GPU status to health check endpoint - - Report GPU availability (true/false) - - Report GPU device name and compute capability - - Report CUDA version - - Report current GPU memory usage -- [x] 4.2 Update `backend/app/main.py` - - Add GPU status fields to health check response - - Handle cases where GPU detection fails - -## 5. Documentation Updates -- [x] 5.1 Update README.md - - Add GPU requirements section - - Document GPU detection and setup process - - Add troubleshooting for GPU issues -- [ ] 5.2 Update openspec/project.md - - Add GPU hardware recommendations - - Document CUDA version compatibility - - Add GPU-specific configuration options -- [ ] 5.3 Create GPU setup guide - - Document NVIDIA driver installation for WSL - - Document CUDA toolkit installation - - Provide GPU verification steps -- [x] 5.4 Document known limitations - - ~~Chart recognition feature disabled (PaddlePaddle 3.0.0 API limitation)~~ **RESOLVED** - - ~~Document `fused_rms_norm_ext` API incompatibility~~ **RESOLVED in PaddlePaddle 3.2.1+** - - Updated README to reflect chart recognition is now enabled - - Created CHART_RECOGNITION.md with detailed status and history - -## 6. Testing -- [ ] 6.1 Test GPU detection on GPU-enabled system - - Verify correct CUDA version detection - - Verify correct PaddlePaddle GPU installation - - Verify OCR processing uses GPU -- [ ] 6.2 Test CPU fallback on non-GPU system - - Verify CPU-only installation - - Verify OCR processing works without GPU - - Verify no errors or warnings about missing GPU -- [ ] 6.3 Test FORCE_CPU_MODE override - - Verify GPU is ignored when FORCE_CPU_MODE=true - - Verify CPU processing works on GPU-enabled system -- [ ] 6.4 Performance benchmarking - - Measure OCR processing time with GPU - - Measure OCR processing time with CPU - - Document performance improvements - -## 7. Error Handling and Edge Cases -- [ ] 7.1 Handle GPU out-of-memory errors - - Catch CUDA OOM exceptions - - Automatically fall back to CPU mode - - Log warning message to user -- [ ] 7.2 Handle CUDA version mismatch - - Detect PaddlePaddle/CUDA compatibility issues - - Provide clear error messages - - Suggest correct CUDA version installation -- [ ] 7.3 Handle missing NVIDIA drivers - - Detect when GPU hardware exists but drivers are missing - - Provide installation instructions - - Fall back to CPU mode gracefully diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/OFFICE_INTEGRATION.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/OFFICE_INTEGRATION.md deleted file mode 100644 index 77031c5..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/OFFICE_INTEGRATION.md +++ /dev/null @@ -1,186 +0,0 @@ -# Office Document Support Integration - -**Date**: 2025-11-12 -**Status**: ✅ INTEGRATED & TESTED -**Sub-Proposal**: [add-office-document-support](../add-office-document-support/PROPOSAL.md) - ---- - -## Overview - -This document tracks the integration of Office document support (DOC, DOCX, PPT, PPTX) into the main OCR batch processing system. The integration was completed as a sub-proposal under the OpenSpec framework. - -## Integration Summary - -### Components Integrated - -1. **Office Converter Service** ([backend/app/services/office_converter.py](../../../backend/app/services/office_converter.py)) - - LibreOffice headless mode for Office to PDF conversion - - Support for DOC, DOCX, PPT, PPTX formats - - Automatic cleanup of temporary conversion files - -2. **Document Preprocessor Enhancement** ([backend/app/services/preprocessor.py](../../../backend/app/services/preprocessor.py)) - - Added Office MIME type mappings (application/msword, application/vnd.openxmlformats-officedocument.*) - - ZIP-based integrity validation for modern Office formats - - Office format detection and validation - -3. **OCR Service Integration** ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py)) - - Office document detection in `process_image()` method - - Automatic conversion pipeline: Office → PDF → Images → OCR - -4. **File Manager Updates** ([backend/app/services/file_manager.py](../../../backend/app/services/file_manager.py)) - - Extended allowed extensions to include Office formats - -5. **Configuration Updates** - - `.env`: Added Office formats to ALLOWED_EXTENSIONS - - `app/core/config.py`: Extended default allowed extensions list - -### Processing Pipeline - -``` -Office Document (DOC/DOCX/PPT/PPTX) - ↓ -LibreOffice Headless Conversion - ↓ -PDF Document - ↓ -PDF to Images (existing) - ↓ -PaddleOCR Processing (existing) - ↓ -Markdown/JSON Output (existing) -``` - -## Test Results - -### Test Document -- **File**: test_document.docx (1,521 bytes) -- **Content**: Mixed Chinese/English text with structured formatting -- **Batch ID**: 24 - -### Results -- **Status**: ✅ Completed Successfully -- **Processing Time**: 375.23 seconds (includes PaddleOCR model initialization) -- **OCR Accuracy**: 97.39% confidence -- **Text Regions**: 20 regions detected -- **Language**: Chinese (mixed with English) - -### Verification -- ✅ DOCX upload and validation -- ✅ DOCX → PDF conversion (LibreOffice headless mode) -- ✅ PDF → Images conversion -- ✅ OCR processing (PaddleOCR with PP-LCNet_x1_0_doc_ori structure analysis) -- ✅ Markdown output generation with preserved structure - -### Output Sample -```markdown -Office Document OCR Test - -測試文件說明 - -這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。 - -本系統現已支援以下 Office格式: - -• Microsoft Word: DOC, DOCX -• Microsoft PowerPoint: PPT, PPTX - -處理流程 - -Office 文件的處理流程如下: - -1. 使用 LibreOffice 將 Office 文件轉換為 PDF -``` - -## Bugs Fixed During Integration - -1. **Database Column Error**: Fixed return value unpacking order in file_manager.py -2. **Missing Office MIME Types**: Added Office MIME type mappings to preprocessor.py -3. **Missing Integrity Validation**: Added Office format integrity validation -4. **Configuration Loading Issue**: Updated `.env` file with Office formats -5. **API Endpoint Mismatch**: Fixed test script to use correct API paths - -## Dependencies Added - -### System Dependencies (Homebrew) -```bash -brew install libreoffice -``` - -### Configuration -- LibreOffice path: `/Applications/LibreOffice.app/Contents/MacOS/soffice` -- Conversion mode: Headless (`--headless --convert-to pdf`) - -## API Changes - -**No breaking changes**. Existing API endpoints remain unchanged: -- `POST /api/v1/upload` - Now accepts Office formats -- `POST /api/v1/ocr/process` - Automatically handles Office formats -- `GET /api/v1/batch/{batch_id}/status` - Unchanged -- `GET /api/v1/ocr/result/{file_id}` - Unchanged - -## Task Updates - -### Main Proposal: add-ocr-batch-processing - -**Updated Tasks**: -- Task 3: Document Preprocessing - **100% complete** (was 83%) -- Task 3.4: Implement Office document to PDF conversion - **✅ COMPLETED** - -**Updated Services**: -- Document Preprocessor: Now includes Office format support -- OCR Service: Now includes Office document conversion pipeline -- Added: Office Converter service - -**Updated Dependencies**: -- Added LibreOffice to system dependencies - -**Updated Phase 1 Progress**: **~87% complete** (was ~85%) - -## Documentation - -### Sub-Proposal Documentation -- [PROPOSAL.md](../add-office-document-support/PROPOSAL.md) - Feature proposal -- [tasks.md](../add-office-document-support/tasks.md) - Implementation tasks -- [IMPLEMENTATION.md](../add-office-document-support/IMPLEMENTATION.md) - Implementation summary - -### Test Resources -- Test script: [demo_docs/office_tests/test_office_upload.py](../../../demo_docs/office_tests/test_office_upload.py) -- Test document: [demo_docs/office_tests/test_document.docx](../../../demo_docs/office_tests/test_document.docx) -- Document creation: [demo_docs/office_tests/create_docx.py](../../../demo_docs/office_tests/create_docx.py) - -## Performance Impact - -- **First-time processing**: ~375 seconds (includes PaddleOCR model download/initialization) -- **Subsequent processing**: Expected to be faster (~10-30 seconds per document) -- **Memory usage**: No significant increase observed -- **Storage**: LibreOffice adds ~600MB to system requirements - -## Migration Notes - -**Backward Compatibility**: ✅ Fully backward compatible -- Existing image and PDF processing unchanged -- No database schema changes required -- No API contract changes - -**Upgrade Path**: -1. Install LibreOffice via Homebrew: `brew install libreoffice` -2. Update `.env` file with Office formats in ALLOWED_EXTENSIONS -3. Restart backend service -4. Verify with test script: `python demo_docs/office_tests/test_office_upload.py` - -## Next Steps - -Integration complete. The Office document support feature is now part of the main OCR batch processing system and ready for production use. - -### Future Enhancements (Optional) -- Add unit tests for office_converter.py -- Add support for Excel files (XLS, XLSX) -- Optimize LibreOffice conversion performance -- Add preview generation for Office documents - ---- - -**Integration Status**: ✅ COMPLETE -**Test Status**: ✅ PASSED -**Documentation Status**: ✅ COMPLETE diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/SESSION_SUMMARY.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/SESSION_SUMMARY.md deleted file mode 100644 index 5a54822..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/SESSION_SUMMARY.md +++ /dev/null @@ -1,294 +0,0 @@ -# Session Summary - 2025-11-12 - -## Completed Work - -### ✅ Task 10: Backend - Background Tasks (83% Complete - 5/6 tasks) - -This session successfully implemented comprehensive background task infrastructure for the Tool_OCR system. - ---- - -## 📋 What Was Implemented - -### 1. Background Tasks Service -**File**: [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py) - -Created `BackgroundTaskManager` class with: -- **Generic retry execution framework** (`execute_with_retry`) -- **File-level retry logic** (`process_single_file_with_retry`) -- **Automatic cleanup scheduler** (`cleanup_expired_files`, `start_cleanup_scheduler`) -- **PDF background generation** (`generate_pdf_background`) -- **Batch processing with retry** (`process_batch_files_with_retry`) - -**Configuration**: -- Max retries: 3 attempts -- Retry delay: 5 seconds -- Cleanup interval: 1 hour -- File retention: 24 hours - -### 2. Database Migration -**File**: [backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py](../../../backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py) - -- Added `retry_count` field to `paddle_ocr_files` table -- Tracks number of retry attempts per file -- Default value: 0 - -### 3. Model Updates -**File**: [backend/app/models/ocr.py](../../../backend/app/models/ocr.py#L76) - -- Added `retry_count` column to `OCRFile` model -- Integrated with retry logic in background tasks - -### 4. Router Updates -**File**: [backend/app/routers/ocr.py](../../../backend/app/routers/ocr.py#L240) - -- Replaced `process_batch_files` with `process_batch_files_with_retry` -- Now uses retry-enabled background processing -- Removed old function, added reference comment - -### 5. Application Lifecycle -**File**: [backend/app/main.py](../../../backend/app/main.py#L42) - -- Added cleanup scheduler to application startup -- Starts automatically as background task -- Graceful shutdown on application stop -- Logs startup/shutdown events - -### 6. Documentation Updates - -**Updated Files**: -- ✅ [openspec/changes/add-ocr-batch-processing/tasks.md](./tasks.md) - Marked Task 10 items as complete -- ✅ [openspec/changes/add-ocr-batch-processing/STATUS.md](./STATUS.md) - Comprehensive status document -- ✅ [SETUP.md](../../../SETUP.md) - Added Background Services section -- ✅ [SESSION_SUMMARY.md](./SESSION_SUMMARY.md) - This file - ---- - -## 🎯 Task 10 Breakdown - -| Task | Description | Status | -|------|-------------|--------| -| 10.1 | Implement FastAPI BackgroundTasks for async OCR processing | ✅ Complete | -| 10.2 | Add task queue system (optional: Redis-based queue) | ⏸️ Optional (not needed) | -| 10.3 | Implement progress updates (polling endpoint) | ✅ Complete | -| 10.4 | Add error handling and retry logic | ✅ Complete | -| 10.5 | Implement cleanup scheduler for expired files | ✅ Complete | -| 10.6 | Add PDF generation to background tasks | ✅ Complete | - -**Overall**: 5/6 tasks complete (83%) - Only optional Redis queue not implemented - ---- - -## 🚀 Features Delivered - -### 1. Automatic Retry Logic -- ✅ Up to 3 retry attempts per file -- ✅ 5-second delay between retries -- ✅ Detailed error messages with retry count -- ✅ Database tracking of retry attempts -- ✅ Configurable retry parameters - -### 2. Cleanup Scheduler -- ✅ Runs every 1 hour automatically -- ✅ Deletes files older than 24 hours -- ✅ Cleans up database records -- ✅ Respects foreign key constraints -- ✅ Logs cleanup activity -- ✅ Configurable retention period - -### 3. Background Task Infrastructure -- ✅ Generic retry execution framework -- ✅ PDF generation with retry logic -- ✅ Proper error handling and logging -- ✅ Graceful startup/shutdown -- ✅ No blocking of main application - -### 4. Monitoring & Observability -- ✅ Detailed logging for all background tasks -- ✅ Startup confirmation messages -- ✅ Cleanup activity logs -- ✅ Retry attempt tracking -- ✅ Health check endpoint verification - ---- - -## ✅ Verification - -### Backend Status -```bash -$ curl http://localhost:12010/health -{"status":"healthy","service":"Tool_OCR","version":"0.1.0"} -``` - -### Cleanup Scheduler -```bash -$ grep "cleanup scheduler" /tmp/tool_ocr_startup.log -2025-11-12 01:52:09,359 - app.main - INFO - Started cleanup scheduler for expired files -2025-11-12 01:52:09,359 - app.services.background_tasks - INFO - Starting cleanup scheduler (interval: 3600s, retention: 24h) -``` - -### Translation API (Reserved) -```bash -$ curl http://localhost:12010/api/v1/translate/status -{"available":false,"status":"reserved","message":"Translation feature is reserved for future implementation",...} -``` - ---- - -## 📂 Files Created/Modified - -### Created -1. `backend/app/services/background_tasks.py` (430 lines) - Background task manager -2. `backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py` - Migration -3. `openspec/changes/add-ocr-batch-processing/STATUS.md` - Comprehensive status -4. `openspec/changes/add-ocr-batch-processing/SESSION_SUMMARY.md` - This file - -### Modified -1. `backend/app/models/ocr.py` - Added retry_count field -2. `backend/app/routers/ocr.py` - Updated to use retry-enabled processing -3. `backend/app/main.py` - Added cleanup scheduler startup -4. `openspec/changes/add-ocr-batch-processing/tasks.md` - Updated Task 10 status -5. `SETUP.md` - Added Background Services section - ---- - -## 🎉 Current Project Status - -### Phase 1: Backend Development (~85% Complete) -- ✅ Task 1: Environment Setup (100%) -- ✅ Task 2: Database Schema (100%) -- ✅ Task 3: Document Preprocessing (83%) -- ✅ Task 4: Core OCR Service (70%) -- ✅ Task 5: PDF Generation (89%) -- ✅ Task 6: File Management (86%) -- ✅ Task 7: Export Service (90%) -- ✅ Task 8: API Endpoints (93%) -- ✅ Task 9: Translation Architecture RESERVED (83%) -- ✅ **Task 10: Background Tasks (83%)** ⬅️ **Just Completed** - -### Backend Services Status -- ✅ **Backend API**: Running on http://localhost:12010 -- ✅ **Cleanup Scheduler**: Active (1-hour interval, 24-hour retention) -- ✅ **Retry Logic**: Enabled (3 attempts, 5-second delay) -- ✅ **Health Check**: Passing - ---- - -## 📝 Next Steps (From OpenSpec) - -### Immediate - Complete Phase 1 -According to OpenSpec [tasks.md](./tasks.md), the remaining Phase 1 tasks are: - -1. **Unit Tests** (Multiple tasks) - - Task 3.6: Preprocessor tests - - Task 4.10: OCR service tests - - Task 5.9: PDF generator tests - - Task 6.7: File manager tests - - Task 7.10: Export service tests - - Task 8.14: API integration tests - - Task 9.6: Translation service tests (optional) - -2. **Complete Task 4.8-4.9** (OCR Service) - - Implement batch processing with worker queue - - Add progress tracking for batch jobs - -### Future Phases -- **Phase 2**: Frontend Development (Tasks 11-14) -- **Phase 3**: Testing & Optimization (Tasks 15-16) -- **Phase 4**: Deployment (Tasks 17-18) -- **Phase 5**: Translation Implementation (Task 19) - ---- - -## 🔍 Technical Notes - -### Why No Redis Queue? -Task 10.2 was marked as optional because: -- FastAPI BackgroundTasks is sufficient for current scale -- No need for horizontal scaling yet -- Simpler deployment without additional dependencies -- Can be added later if needed - -### Retry Logic Design -The retry system was designed to be: -- **Generic**: `execute_with_retry` works with any function -- **Configurable**: Retry count and delay can be adjusted -- **Transparent**: Logs all retry attempts -- **Persistent**: Tracks retry count in database - -### Cleanup Strategy -The cleanup scheduler: -- Runs on a fixed interval (not cron-based) -- Only cleans completed/failed/partial batches -- Deletes files before database records -- Handles errors gracefully without stopping - ---- - -## 🔧 Configuration Options - -To modify background task behavior, edit [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py): - -```python -# Create custom task manager instance -custom_manager = BackgroundTaskManager( - max_retries=5, # Increase retry attempts - retry_delay=10, # Longer delay between retries - cleanup_interval=7200, # Run cleanup every 2 hours - file_retention_hours=48 # Keep files for 48 hours -) -``` - ---- - -## 📊 Code Statistics - -### Lines of Code Added -- background_tasks.py: **430 lines** -- Migration file: **32 lines** -- STATUS.md: **580 lines** -- SESSION_SUMMARY.md: **280 lines** - -**Total New Code**: ~1,300 lines - -### Files Modified -- 5 existing files updated -- 4 new files created - ---- - -## ✨ Key Achievements - -1. ✅ **Robust Error Handling**: Automatic retry logic ensures transient failures don't lose work -2. ✅ **Automatic Cleanup**: No manual intervention needed for old files -3. ✅ **Scalable Architecture**: Background tasks allow async processing -4. ✅ **Production Ready**: Graceful startup/shutdown, logging, monitoring -5. ✅ **Well Documented**: Comprehensive docs for all new features -6. ✅ **OpenSpec Compliant**: Followed specification exactly - ---- - -## 🎓 Lessons Learned - -1. **Async cleanup scheduler** requires `asyncio.create_task()` in lifespan context -2. **Retry logic** should track attempts in database for debugging -3. **Background tasks** need separate database sessions -4. **Graceful shutdown** requires catching `asyncio.CancelledError` -5. **Logging** is critical for monitoring background services - ---- - -## 🔗 Related Documentation - -- **OpenSpec**: [SPEC.md](./SPEC.md) -- **Tasks**: [tasks.md](./tasks.md) -- **Status**: [STATUS.md](./STATUS.md) -- **Setup**: [SETUP.md](../../../SETUP.md) -- **API Docs**: http://localhost:12010/docs - ---- - -**Session Completed**: 2025-11-12 -**Time Invested**: ~1 hour -**Tasks Completed**: Task 10 (5/6 subtasks) -**Next Session**: Begin unit test implementation (Tasks 3.6, 4.10, 5.9, 6.7, 7.10, 8.14) diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/STATUS.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/STATUS.md deleted file mode 100644 index ec6199a..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/STATUS.md +++ /dev/null @@ -1,616 +0,0 @@ -# Tool_OCR Development Status - -**Last Updated**: 2025-11-12 -**Phase**: Phase 2 - Frontend Development (In Progress) -**Current Task**: Frontend API Schema Alignment - Fixed 6 critical API mismatches - ---- - -## 📊 Overall Progress - -### Phase 1: Backend Development (Core OCR + Layout Preservation) -- ✅ Task 1: Environment Setup (100%) -- ✅ Task 2: Database Schema (100%) -- ✅ Task 3: Document Preprocessing (100%) - Office format support integrated -- ✅ Task 4: Core OCR Service (100%) -- ✅ Task 5: PDF Generation (100%) -- ✅ Task 6: File Management (100%) -- ✅ Task 7: Export Service (100%) -- ✅ Task 8: API Endpoints (100% - 14/14 tasks) ⬅️ **Updated: All endpoints aligned with frontend** -- ✅ Task 9: Translation Architecture RESERVED (83% - 5/6 tasks) -- ✅ Task 10: Background Tasks (83% - 5/6 tasks) - -**Phase 1 Status**: ~98% complete - -### Phase 2: Frontend Development (In Progress) -- ✅ Task 11: Frontend Project Structure (100%) -- ✅ Task 12: UI Components (70% - 7/10 tasks) ⬅️ **Updated** -- ✅ Task 13: Pages (100% - 8/8 tasks) ⬅️ **Updated: All pages functional** -- ✅ Task 14: API Integration (100% - 10/10 tasks) ⬅️ **Updated: API schemas aligned** - -**Phase 2 Status**: ~92% complete ⬅️ **Updated: Core functionality working** - -### Remaining Phases -- ⏳ Phase 3: Testing & Documentation (Partially complete - manual testing done) -- ⏳ Phase 4: Deployment (Not started) -- ⏳ Phase 5: Translation Implementation (Reserved for future) - ---- - -## 🎯 Task 10 Implementation Details - -### ✅ Completed (5/6) - -**10.1 FastAPI BackgroundTasks for Async OCR Processing** -- File: [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py) -- Implemented `BackgroundTaskManager` class -- OCR processing runs asynchronously via FastAPI BackgroundTasks -- Router updated: [backend/app/routers/ocr.py:240](../../../backend/app/routers/ocr.py#L240) - -**10.3 Progress Updates** -- Batch progress tracking already implemented in Task 8 -- Properties: `batch.completed_files`, `batch.failed_files`, `batch.progress_percentage` -- Endpoint: `GET /api/v1/batch/{batch_id}/status` - -**10.4 Error Handling with Retry Logic** -- File: [backend/app/services/background_tasks.py:63](../../../backend/app/services/background_tasks.py#L63) -- Implemented `execute_with_retry()` method for generic retry logic -- Implemented `process_single_file_with_retry()` for OCR processing with 3 retry attempts -- Added `retry_count` field to `OCRFile` model -- Migration: [backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py](../../../backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py) -- Configurable retry delay (default: 5 seconds) -- Error messages include retry attempt information - -**10.5 Cleanup Scheduler for Expired Files** -- File: [backend/app/services/background_tasks.py:189](../../../backend/app/services/background_tasks.py#L189) -- Implemented `cleanup_expired_files()` method -- Automatic cleanup of files older than 24 hours -- Runs every 1 hour (configurable via `cleanup_interval`) -- Deletes: - - Physical files and directories - - Database records (results, files, batches) -- Respects foreign key constraints -- Started automatically on application startup: [backend/app/main.py:42](../../../backend/app/main.py#L42) -- Gracefully stopped on shutdown - -**10.6 PDF Generation in Background Tasks** -- File: [backend/app/services/background_tasks.py:226](../../../backend/app/services/background_tasks.py#L226) -- Implemented `generate_pdf_background()` method -- PDF generation runs with retry logic (2 retries, 3-second delay) -- Ready to be integrated with export endpoints - -### ⏸️ Optional (1/6) - -**10.2 Redis-based Task Queue** -- Status: Not implemented (marked as optional in OpenSpec) -- Current approach: FastAPI BackgroundTasks (sufficient for current scale) -- Future consideration: Can add Redis queue if needed for horizontal scaling - ---- - -## 🗄️ Database Status - -### Current Schema -All tables use `paddle_ocr_` prefix for namespace isolation in shared database. - -**Tables Created**: -1. `paddle_ocr_users` - User authentication (JWT) -2. `paddle_ocr_batches` - Batch processing metadata -3. `paddle_ocr_files` - Individual file records (now includes `retry_count`) -4. `paddle_ocr_results` - OCR results (Markdown, JSON, images) -5. `paddle_ocr_export_rules` - User-defined export rules -6. `paddle_ocr_translation_configs` - RESERVED for Phase 5 - -**Migrations Applied**: -- ✅ a7802b126240: Initial migration with paddle_ocr prefix -- ✅ 271dc036ea80: Add retry_count to files - -### Test Data -**Test Users**: -- Username: `admin` / Password: `admin123` (Admin role) -- Username: `testuser` / Password: `test123` (Regular user) - ---- - -## 🔧 Services Implemented - -### Core Services - -1. **Document Preprocessor** ([backend/app/services/preprocessor.py](../../../backend/app/services/preprocessor.py)) - - File format validation (PNG, JPG, JPEG, PDF, DOC, DOCX, PPT, PPTX) - - Office document MIME type detection - - ZIP-based integrity validation for modern Office formats - - Corruption detection - - Format standardization - - Status: 100% complete (Office format support integrated via sub-proposal) - -2. **OCR Service** ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py)) - - PaddleOCR 3.x integration (PPStructureV3) - - Layout detection and preservation - - Multi-language support (ch, en, japan, korean) - - Office document to PDF conversion pipeline (via LibreOffice) - - Markdown and JSON output - - Status: 100% complete ⬅️ **Updated: Unit tests complete (48 tests passing)** - -3. **PDF Generator** ([backend/app/services/pdf_generator.py](../../../backend/app/services/pdf_generator.py)) - - Pandoc (preferred) + WeasyPrint (fallback) - - Three CSS templates: default, academic, business - - Chinese font support (Noto Sans CJK) - - Layout preservation - - Status: 100% complete ⬅️ **Updated: Unit tests complete (27 tests passing)** - -4. **File Manager** ([backend/app/services/file_manager.py](../../../backend/app/services/file_manager.py)) - - Batch directory management - - File access control - - Temporary file cleanup (via cleanup scheduler) - - Status: 100% complete ⬅️ **Updated: Unit tests complete (38 tests passing)** - -5. **Export Service** ([backend/app/services/export_service.py](../../../backend/app/services/export_service.py)) - - Six formats: TXT, JSON, Excel, Markdown, PDF, ZIP - - Rule-based filtering and formatting - - CRUD for export rules - - Status: 100% complete ⬅️ **Updated: Unit tests complete (37 tests passing)** - -6. **Background Tasks** ([backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py)) - - Retry logic for OCR processing - - Automatic file cleanup scheduler - - PDF generation with retry - - Generic retry execution framework - - Status: 83% complete - -7. **Office Converter** ([backend/app/services/office_converter.py](../../../backend/app/services/office_converter.py)) ⬅️ **Integrated via sub-proposal** - - LibreOffice headless mode for Office to PDF conversion - - Support for DOC, DOCX, PPT, PPTX formats - - Automatic cleanup of temporary conversion files - - Integration with OCR processing pipeline - - Status: 100% complete (tested with 97.39% OCR accuracy) - -8. **Translation Service** (RESERVED) ([backend/app/services/translation_service.py](../../../backend/app/services/translation_service.py)) - - Stub implementation for Phase 5 - - Interface defined for future engines: Argos, ERNIE, Google, DeepL - - Status: Reserved (not implemented) - ---- - -## 🔌 API Endpoints - -### Authentication -- ✅ `POST /api/v1/auth/login` - JWT authentication - -### File Upload -- ✅ `POST /api/v1/upload` - Batch file upload with validation - -### OCR Processing -- ✅ `POST /api/v1/ocr/process` - Trigger OCR (uses background tasks with retry) -- ✅ `GET /api/v1/batch/{batch_id}/status` - Get batch status with progress -- ✅ `GET /api/v1/ocr/result/{file_id}` - Get OCR results - -### Export -- ✅ `POST /api/v1/export` - Export results (TXT, JSON, Excel, Markdown, PDF, ZIP) -- ✅ `GET /api/v1/export/pdf/{file_id}` - Generate layout-preserved PDF -- ✅ `GET /api/v1/export/rules` - List export rules -- ✅ `POST /api/v1/export/rules` - Create export rule -- ✅ `PUT /api/v1/export/rules/{rule_id}` - Update export rule -- ✅ `DELETE /api/v1/export/rules/{rule_id}` - Delete export rule -- ✅ `GET /api/v1/export/css-templates` - List CSS templates - -### Translation (RESERVED) -- ✅ `GET /api/v1/translate/status` - Feature status (returns "reserved") -- ✅ `GET /api/v1/translate/languages` - Planned languages -- ✅ `POST /api/v1/translate/document` - Returns 501 Not Implemented -- ✅ `GET /api/v1/translate/task/{task_id}` - Returns 501 Not Implemented -- ✅ `DELETE /api/v1/translate/task/{task_id}` - Returns 501 Not Implemented - -**API Documentation**: http://localhost:12010/docs (FastAPI auto-generated) - ---- - -## 🖥️ Environment Setup - -### Conda Environment -- Name: `tool_ocr` -- Python: 3.10 -- Platform: macOS Apple Silicon (ARM64) - -### Key Dependencies -- **FastAPI**: Web framework -- **PaddleOCR 3.x**: OCR engine with PPStructureV3 -- **SQLAlchemy**: ORM for MySQL -- **Alembic**: Database migrations -- **WeasyPrint + Pandoc**: PDF generation -- **LibreOffice**: Office document to PDF conversion (headless mode) -- **python-magic**: File type detection -- **bcrypt 4.2.1**: Password hashing (pinned for compatibility) -- **email-validator**: Email validation for Pydantic - -### System Dependencies -- **Homebrew packages**: - - `libmagic` - File type detection - - `pango`, `gdk-pixbuf`, `libffi` - WeasyPrint dependencies - - `font-noto-sans-cjk` - Chinese font support - - `pandoc` - Document conversion (optional) - - `libreoffice` - Office document conversion (headless mode) - -### Environment Variables -```bash -MYSQL_HOST=mysql.theaken.com -MYSQL_PORT=33306 -MYSQL_DATABASE=db_A060 -BACKEND_PORT=12010 -SECRET_KEY= -DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH -``` - -### Critical Configuration -- **Database Prefix**: All tables use `paddle_ocr_` prefix (shared database) -- **File Retention**: 24 hours (automatic cleanup) -- **Cleanup Interval**: 1 hour -- **Retry Attempts**: 3 (configurable) -- **Retry Delay**: 5 seconds (configurable) - ---- - -## 🔧 Service Status - -### Backend Service -- **Status**: ✅ Running -- **URL**: http://localhost:12010 -- **Log File**: `/tmp/tool_ocr_startup.log` -- **Process**: Running via Uvicorn with auto-reload - -### Background Services -- **Cleanup Scheduler**: ✅ Running (interval: 3600s, retention: 24h) -- **OCR Processing**: ✅ Background tasks with retry logic - -### Health Check -```bash -curl http://localhost:12010/health -# Response: {"status":"healthy","service":"Tool_OCR","version":"0.1.0"} -``` - ---- - -## 📝 Known Issues & Workarounds - -### 1. Shared Database Environment -- **Issue**: Database contains tables from other projects -- **Solution**: All tables use `paddle_ocr_` prefix for namespace isolation -- **Important**: NEVER drop tables in migrations (only create) - -### 2. PaddleOCR 3.x Compatibility -- **Issue**: Parameters `show_log` and `use_gpu` removed in PaddleOCR 3.x -- **Solution**: Updated service to remove obsolete parameters -- **Issue**: `PPStructure` renamed to `PPStructureV3` -- **Solution**: Updated imports - -### 3. Bcrypt Version -- **Issue**: Latest bcrypt incompatible with passlib -- **Solution**: Pinned to `bcrypt==4.2.1` - -### 4. WeasyPrint on macOS -- **Issue**: Missing shared libraries -- **Solution**: Install via Homebrew and set `DYLD_LIBRARY_PATH` - -### 5. First OCR Run -- **Issue**: First OCR test may fail as PaddleOCR downloads models (~900MB) -- **Solution**: Wait for download to complete, then retry -- **Model Location**: `~/.paddlex/` - ---- - -## 🧪 Test Coverage - -### Unit Tests Summary -**Total Tests**: 187 -**Passed**: 182 ✅ (97.3% pass rate) -**Skipped**: 5 (acceptable - technical limitations or covered elsewhere) -**Failed**: 0 ✅ - -### Test Breakdown by Module - -1. **test_preprocessor.py**: 32 tests ✅ - - Format validation (PNG, JPG, PDF, Office formats) - - MIME type mapping - - Integrity validation - - File information extraction - - Edge cases - -2. **test_ocr_service.py**: 48 tests ✅ - - PaddleOCR 3.x integration - - Layout detection and preservation - - Markdown generation - - JSON output - - Real image processing (demo_docs/basic/english.png) - - Structure engine initialization - -3. **test_pdf_generator.py**: 27 tests ✅ - - Pandoc integration - - WeasyPrint fallback - - CSS template management - - Unicode and table support - - Error handling - -4. **test_file_manager.py**: 38 tests ✅ - - File upload validation - - Batch management - - Access control - - Cleanup operations - -5. **test_export_service.py**: 37 tests ✅ - - Six export formats (TXT, JSON, Excel, Markdown, PDF, ZIP) - - Rule-based filtering and formatting - - Export rule CRUD operations - -6. **test_api_integration.py**: 5 tests ✅ - - API endpoint integration - - JWT authentication - - Upload and OCR workflow - -### Skipped Tests (Acceptable) -1. `test_export_txt_success` - FileResponse validation (covered in unit tests) -2. `test_generate_pdf_success` - FileResponse validation (covered in unit tests) -3. `test_create_export_rule` - SQLite session isolation (works with MySQL) -4. `test_update_export_rule` - SQLite session isolation (works with MySQL) -5. `test_validate_upload_file_too_large` - Complex UploadFile mock (covered in integration) - -### Test Coverage Achievements -- ✅ All service layers tested with comprehensive unit tests -- ✅ PaddleOCR 3.x format compatibility verified -- ✅ Real image processing with demo samples -- ✅ Edge cases and error handling covered -- ✅ Integration tests for critical workflows - ---- - -## 🌐 Phase 2: Frontend API Schema Alignment (2025-11-12) - -### Issue Summary -During frontend development, identified 6 critical API mismatches between frontend expectations and backend implementation that blocked upload, processing, and results preview functionality. - -### 🐛 API Mismatches Fixed - -**1. Upload Response Structure** ⬅️ **FIXED** -- **Problem**: Backend returned `OCRBatchResponse` with `id` field, frontend expected `{ batch_id, files }` -- **Solution**: Created `UploadBatchResponse` schema in [backend/app/schemas/ocr.py:91-115](../../../backend/app/schemas/ocr.py#L91-L115) -- **Impact**: Upload now returns correct structure, fixes "no response after upload" issue -- **Files Modified**: - - `backend/app/schemas/ocr.py` - Added UploadBatchResponse schema - - `backend/app/routers/ocr.py:38,72-75` - Updated response_model and return format - -**2. Error Field Naming** ⬅️ **FIXED** -- **Problem**: Frontend read `file.error`, backend had `error_message` field -- **Solution**: Added Pydantic validation_alias in [backend/app/schemas/ocr.py:21](../../../backend/app/schemas/ocr.py#L21) -- **Code**: `error: Optional[str] = Field(None, validation_alias='error_message')` -- **Impact**: Error messages now display correctly in ProcessingPage - -**3. Markdown Content Missing** ⬅️ **FIXED** -- **Problem**: Frontend needed `markdown_content` for preview, only path was provided -- **Solution**: Added field to OCRResultResponse in [backend/app/schemas/ocr.py:35](../../../backend/app/schemas/ocr.py#L35) -- **Code**: `markdown_content: Optional[str] = None # Added for frontend preview` -- **Impact**: Markdown preview now works in ResultsPage - -**4. Export Options Schema Missing** ⬅️ **FIXED** -- **Problem**: Frontend sent `options` object, backend didn't accept it -- **Solution**: Created ExportOptions schema in [backend/app/schemas/export.py:10-15](../../../backend/app/schemas/export.py#L10-L15) -- **Fields**: `confidence_threshold`, `include_metadata`, `filename_pattern`, `css_template` -- **Impact**: Advanced export options now supported - -**5. CSS Template Filename Field** ⬅️ **FIXED** -- **Problem**: Frontend needed `filename`, backend only had `name` and `description` -- **Solution**: Added filename field to CSSTemplateResponse in [backend/app/schemas/export.py:82](../../../backend/app/schemas/export.py#L82) -- **Code**: `filename: str = Field(..., description="Template filename")` -- **Impact**: CSS template selector now works correctly - -**6. OCR Result Detail Structure** ⬅️ **FIXED** (Critical) -- **Problem**: ResultsPage showed "檢視 Markdown - undefined" because: - - Backend returned nested `{ file: {...}, result: {...} }` structure - - Frontend expected flat structure with `filename`, `confidence`, `markdown_content` at root -- **Solution**: Created OCRResultDetailResponse schema in [backend/app/schemas/ocr.py:77-89](../../../backend/app/schemas/ocr.py#L77-L89) -- **Solution**: Updated endpoint in [backend/app/routers/ocr.py:181-240](../../../backend/app/routers/ocr.py#L181-L240) to: - - Read markdown content from filesystem - - Build flattened JSON data structure - - Return all fields frontend expects at root level -- **Impact**: - - MarkdownPreview now shows correct filename in title - - Confidence and processing time display correctly - - Markdown content loads and displays properly - -### ✅ Frontend Functionality Restored - -**Upload Flow**: -1. ✅ Files upload with progress indication -2. ✅ Toast notification on success -3. ✅ Automatic redirect to Processing page -4. ✅ Batch ID and files stored in Zustand state - -**Processing Flow**: -1. ✅ Batch status polling works -2. ✅ Progress percentage updates in real-time -3. ✅ File status badges display correctly (pending/processing/completed/failed) -4. ✅ Error messages show when files fail -5. ✅ Automatic redirect to Results when complete - -**Results Flow**: -1. ✅ Batch summary displays (batch ID, completed count) -2. ✅ Results table shows all files with actions -3. ✅ Click file to view markdown preview -4. ✅ Markdown title shows correct filename (not "undefined") -5. ✅ Confidence and processing time display correctly -6. ✅ PDF download works -7. ✅ Export button navigates to export page - -### 📝 Additional Frontend Fixes - -**1. ResultsPage.tsx** ([frontend/src/pages/ResultsPage.tsx:134-143](../../../frontend/src/pages/ResultsPage.tsx#L134-L143)) -- Added null checks for undefined values: - - `(ocrResult.confidence || 0)` - Prevents .toFixed() on undefined - - `(ocrResult.processing_time || 0)` - Prevents .toFixed() on undefined - - `ocrResult.json_data?.total_text_regions || 0` - Safe optional chaining - -**2. ProcessingPage.tsx** (Already functional) -- Batch ID validation working -- Status polling implemented correctly -- Error handling complete - -### 🔧 API Endpoints Updated - -**Upload Endpoint**: -```typescript -POST /api/v1/upload -Response: { batch_id: number, files: OCRFileResponse[] } -``` - -**Batch Status Endpoint**: -```typescript -GET /api/v1/batch/{batch_id}/status -Response: { batch: OCRBatchResponse, files: OCRFileResponse[] } -``` - -**OCR Result Endpoint** (New flattened structure): -```typescript -GET /api/v1/ocr/result/{file_id} -Response: { - file_id: number - filename: string - status: string - markdown_content: string - json_data: {...} - confidence: number - processing_time: number -} -``` - -### 🎯 Testing Verified -- ✅ File upload with toast notification -- ✅ Redirect to processing page -- ✅ Processing status polling -- ✅ Completed batch redirect to results -- ✅ Results table display -- ✅ Markdown preview with correct filename -- ✅ Confidence and processing time display -- ✅ PDF download functionality - -### 📊 Phase 2 Progress Update -- Task 12: UI Components - **70% complete** (MarkdownPreview working, missing Export/Rule editors) -- Task 13: Pages - **100% complete** (All core pages functional) -- Task 14: API Integration - **100% complete** (All API schemas aligned) - -**Phase 2 Overall**: ~92% complete (Core user journey working end-to-end) - ---- - -## 🎯 Next Steps - -### Immediate (Complete Phase 1) -1. ~~**Write Unit Tests** (Tasks 3.6, 4.10, 5.9, 6.7, 7.10)~~ ✅ **COMPLETE** - - ~~Preprocessor tests~~ ✅ - - ~~OCR service tests~~ ✅ - - ~~PDF generator tests~~ ✅ - - ~~File manager tests~~ ✅ - - ~~Export service tests~~ ✅ - -2. **API Integration Tests** (Task 8.14) - - End-to-end workflow tests - - Authentication tests - - Error handling tests - -3. **Final Phase 1 Documentation** - - API usage examples - - Deployment guide - - Performance benchmarks - -### Phase 2: Frontend Development (Not Started) -- Task 11: Frontend project structure (Vite + React + TypeScript) -- Task 12: UI components (shadcn/ui) -- Task 13: Pages (Login, Upload, Processing, Results, Export) -- Task 14: API integration - -### Phase 3: Testing & Optimization -- Comprehensive testing -- Performance optimization -- Documentation completion - -### Phase 4: Deployment -- Production environment setup -- 1Panel deployment -- SSL configuration -- Monitoring setup - -### Phase 5: Translation Feature (Future) -- Choose translation engine (Argos/ERNIE/Google/DeepL) -- Implement translation service -- Update UI to enable translation features - ---- - -## 📚 Documentation - -### Setup Documentation -- [SETUP.md](../../../SETUP.md) - Environment setup and installation -- [README.md](../../../README.md) - Project overview - -### OpenSpec Documentation -- [SPEC.md](./SPEC.md) - Complete specification -- [tasks.md](./tasks.md) - Task breakdown and progress -- [STATUS.md](./STATUS.md) - This file -- [OFFICE_INTEGRATION.md](./OFFICE_INTEGRATION.md) - Office document support integration summary - -### Sub-Proposals -- [add-office-document-support](../add-office-document-support/PROPOSAL.md) - Office format support (✅ INTEGRATED) - -### API Documentation -- **Interactive Docs**: http://localhost:12010/docs -- **ReDoc**: http://localhost:12010/redoc - ---- - -## 🔍 Testing Commands - -### Start Backend -```bash -source ~/.zshrc -conda activate tool_ocr -export DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH -python -m app.main -``` - -### Test Service Layer -```bash -cd backend -python test_services.py -``` - -### Test API (Login) -```bash -curl -X POST http://localhost:12010/api/v1/auth/login \ - -H "Content-Type: application/json" \ - -d '{"username": "admin", "password": "admin123"}' -``` - -### Check Cleanup Scheduler -```bash -tail -f /tmp/tool_ocr_startup.log | grep cleanup -``` - -### Check Batch Progress -```bash -curl http://localhost:12010/api/v1/batch/{batch_id}/status -``` - ---- - -## 📞 Support & Feedback - -- **Project**: Tool_OCR - OCR Batch Processing System -- **Development Approach**: OpenSpec-driven development -- **Current Status**: Phase 2 Frontend ~92% complete ⬅️ **Updated: Core user journey working end-to-end** -- **Backend Test Coverage**: 182/187 tests passing (97.3%) -- **Next Milestone**: Complete remaining UI components (Export/Rule editors), Phase 3 testing - ---- - -**Status Summary**: -- **Phase 1 (Backend)**: ~98% complete - All core functionality working with comprehensive test coverage -- **Phase 2 (Frontend)**: ~92% complete - Core user journey (Upload → Processing → Results) fully functional -- **Recent Work**: Fixed 6 critical API schema mismatches between frontend and backend, enabling end-to-end workflow -- **Verification**: Upload, OCR processing, and results preview all working correctly with proper error handling diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/design.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/design.md deleted file mode 100644 index 88dd984..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/design.md +++ /dev/null @@ -1,313 +0,0 @@ -# Technical Design Document - -## Context -Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage. - -**Key stakeholders:** -- End users: Need simple, fast, reliable OCR processing -- Developers: Need maintainable, testable code architecture -- Operations: Need easy deployment via 1Panel, monitoring, and error tracking - -**Constraints:** -- Development on Windows with Conda (Python 3.10) -- Deployment on Linux server via 1Panel (no Docker) -- Port range: 12010-12019 -- External MySQL database (mysql.theaken.com:33306) -- PaddleOCR models (~100-200MB per language) -- Max file upload: 20MB per file, 100MB per batch - -## Goals / Non-Goals - -### Goals -- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean) -- Handle batch uploads with real-time progress tracking -- Provide flexible export formats (TXT, JSON, Excel) with custom rules -- Maintain responsive UI during long-running OCR tasks -- Enable easy deployment and maintenance via 1Panel - -### Non-Goals -- Real-time OCR streaming (batch processing only) -- Cloud-based OCR services (local processing only) -- Mobile app support (web UI only, desktop/tablet optimized) -- Advanced image editing or annotation features -- Multi-tenant SaaS architecture (single deployment per organization) - -## Decisions - -### Decision 1: FastAPI for Backend Framework -**Choice:** Use FastAPI instead of Flask or Django - -**Rationale:** -- Native async/await support for I/O-bound operations (file upload, database queries) -- Automatic OpenAPI documentation (Swagger UI) -- Built-in Pydantic validation for type safety -- Better performance for concurrent requests -- Modern Python 3.10+ features (type hints, async) - -**Alternatives considered:** -- Flask: Simpler but lacks native async, requires extensions -- Django: Too heavyweight for API-only backend, includes unnecessary ORM features - -### Decision 2: PaddleOCR as OCR Engine -**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs - -**Rationale:** -- Excellent Chinese/multilingual support (key requirement) -- Higher accuracy with deep learning models -- Offline operation (no API costs or internet dependency) -- Active development and good documentation -- GPU acceleration support (optional) - -**Alternatives considered:** -- Tesseract: Lower accuracy for Chinese, older technology -- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns - -### Decision 3: React Query for API State Management -**Choice:** Use React Query (TanStack Query) instead of Redux - -**Rationale:** -- Designed specifically for server state (API calls, caching, refetching) -- Built-in loading/error states -- Automatic background refetching and cache invalidation -- Reduces boilerplate compared to Redux -- Better for our API-heavy use case - -**Alternatives considered:** -- Redux: Overkill for server state, more boilerplate -- Plain Axios: Requires manual loading/error state management - -### Decision 4: Zustand for Client State -**Choice:** Use Zustand for global UI state (separate from React Query) - -**Rationale:** -- Lightweight (1KB) and simple API -- No providers or context required -- TypeScript-friendly -- Works well alongside React Query -- Only for UI state (selected files, filters, etc.) - -### Decision 5: Background Task Processing -**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially) - -**Rationale:** -- Built-in FastAPI feature, no additional dependencies -- Sufficient for single-server deployment -- Simpler deployment and maintenance -- Can migrate to Redis/Celery later if needed - -**Migration path:** If scale requires, add Redis + Celery for distributed task queue - -**Alternatives considered:** -- Celery + Redis: More complex, overkill for initial deployment -- Threading: FastAPI BackgroundTasks already uses thread pool - -### Decision 6: File Storage Strategy -**Choice:** Local filesystem with automatic cleanup (24-hour retention) - -**Rationale:** -- Simple implementation, no S3/cloud storage costs -- OCR results stored in database (permanent) -- Original files temporary, only needed during processing -- Automatic cleanup prevents disk space issues - -**Storage structure:** -``` -uploads/ - {batch_id}/ - {file_id}_original.png - {file_id}_preprocessed.png (if preprocessing enabled) -``` - -**Cleanup:** Daily cron job or background task deletes files older than 24 hours - -### Decision 7: Real-time Progress Updates -**Choice:** HTTP polling instead of WebSocket - -**Rationale:** -- Simpler implementation and deployment -- Works better with Nginx reverse proxy and 1Panel -- Sufficient UX for batch processing (poll every 2 seconds) -- No need for persistent connections - -**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage - -**Alternatives considered:** -- WebSocket: More complex, requires special Nginx config, overkill for this use case - -### Decision 8: Database Schema Design -**Choice:** Separate tables for tasks, files, and results (normalized) - -**Schema:** -```sql -users (id, username, password_hash, created_at) -ocr_batches (id, user_id, status, created_at, completed_at) -ocr_files (id, batch_id, filename, file_path, file_size, status) -ocr_results (id, file_id, text, bbox_json, confidence, language) -export_rules (id, user_id, rule_name, config_json) -``` - -**Rationale:** -- Normalized for data integrity -- Supports batch tracking and partial failures -- Easy to query individual file results or batch statistics -- Export rules reusable across users - -### Decision 9: Export Rule Configuration Format -**Choice:** JSON-based rule configuration stored in database - -**Example rule:** -```json -{ - "filters": { - "min_confidence": 0.8, - "filename_pattern": "^invoice_.*" - }, - "formatting": { - "add_line_numbers": true, - "sort_by_position": true, - "group_by_page": true - }, - "output": { - "format": "txt", - "encoding": "utf-8", - "line_separator": "\n" - } -} -``` - -**Rationale:** -- Flexible and extensible -- Easy to validate with JSON schema -- Can be edited via UI or API -- Supports complex rules without database schema changes - -### Decision 10: Deployment Architecture (1Panel) -**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager) - -**Architecture:** -``` -[Client Browser] - ↓ -[Nginx :80/443] (managed by 1Panel) - ↓ - ├─ / → Frontend static files (React build) - ├─ /assets → Static assets - └─ /api → Reverse proxy to backend :12010 - ↓ - [FastAPI Backend :12010] (managed by Supervisor) - ↓ - [MySQL :33306] (external) -``` - -**Rationale:** -- 1Panel provides GUI for Nginx management -- Supervisor ensures backend auto-restart on failure -- No Docker simplifies deployment on existing infrastructure -- Standard Nginx config works without special 1Panel requirements - -**Supervisor config:** -```ini -[program:tool_ocr_backend] -command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010 -directory=/path/to/Tool_OCR/backend -user=www-data -autostart=true -autorestart=true -``` - -## Risks / Trade-offs - -### Risk 1: OCR Processing Time for Large Batches -**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout - -**Mitigation:** -- Use FastAPI BackgroundTasks to avoid HTTP timeout -- Return batch_id immediately, client polls for status -- Display progress bar with estimated time remaining -- Limit max batch size to 50 files (configurable) -- Add worker concurrency limit to prevent resource exhaustion - -### Risk 2: PaddleOCR Model Download on First Run -**Risk:** Models are 100-200MB, first-time download may fail or be slow - -**Mitigation:** -- Pre-download models during deployment setup -- Provide manual download script for offline installation -- Cache models in shared directory for all users -- Include model version in deployment docs - -### Risk 3: File Upload Size Limits -**Risk:** Users may try to upload very large PDFs (>20MB) - -**Mitigation:** -- Enforce 20MB per file, 100MB per batch limits in frontend and backend -- Display clear error messages with limit information -- Provide guidance on compressing PDFs or splitting large files -- Consider adding image downsampling for huge images - -### Risk 4: Concurrent User Scaling -**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory - -**Mitigation:** -- Limit concurrent OCR workers (e.g., 4 workers max) -- Implement task queue with FastAPI BackgroundTasks -- Monitor resource usage and add throttling if needed -- Document recommended server specs (8GB RAM, 4 CPU cores) - -### Risk 5: Database Connection Pool Exhaustion -**Risk:** External MySQL may have connection limits - -**Mitigation:** -- Configure SQLAlchemy connection pool (max 20 connections) -- Use connection pooling with proper timeout settings -- Close connections properly in all API endpoints -- Add health check endpoint to monitor database connectivity - -## Migration Plan - -### Phase 1: Initial Deployment -1. Setup Conda environment on production server -2. Install Python dependencies and download OCR models -3. Configure MySQL database and create tables -4. Build frontend static files (`npm run build`) -5. Configure Nginx via 1Panel (upload nginx.conf) -6. Setup Supervisor for backend process -7. Test with sample images - -### Phase 2: Production Rollout -1. Create admin user account -2. Import sample export rules -3. Perform smoke tests (upload, OCR, export) -4. Monitor logs for errors -5. Setup daily cleanup cron job for old files -6. Enable HTTPS via 1Panel (Let's Encrypt) - -### Phase 3: Monitoring and Optimization -1. Add application logging (file + console) -2. Monitor resource usage (CPU, memory, disk) -3. Optimize slow queries if needed -4. Tune worker concurrency based on actual load -5. Collect user feedback and iterate - -### Rollback Plan -- Keep previous version in separate directory -- Use Supervisor to stop current version and start previous -- Database migrations should be backward compatible -- If major issues, restore database from backup - -## Open Questions - -1. **Should we add user registration, or use admin-created accounts only?** - - Recommendation: Start with admin-created accounts for security, add registration later if needed - -2. **Do we need audit logging for compliance?** - - Recommendation: Add basic audit trail (who uploaded what, when) in database - -3. **Should we support GPU acceleration for PaddleOCR?** - - Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable - -4. **What's the desired behavior for duplicate filenames in a batch?** - - Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`) - -5. **Should export rules be shareable across users or private?** - - Recommendation: Private by default, add "public templates" feature later diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/proposal.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/proposal.md deleted file mode 100644 index a28e238..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/proposal.md +++ /dev/null @@ -1,48 +0,0 @@ -# Change: Add OCR Batch Processing System with Structure Extraction - -## Why -Users need a web-based solution to extract text, images, and structure from multiple document files efficiently. Current manual text extraction is time-consuming and error-prone. This system will automate the process with multi-language OCR support (Chinese, English, etc.), intelligent layout analysis to understand document structure, and provide flexible export options including searchable PDF with embedded images. The extracted content preserves logical structure and reading order (not pixel-perfect visual layout). The system also reserves architecture for future document translation capabilities. - -## What Changes -- Add core OCR processing capability using **PaddleOCR-VL** (vision-language model for document parsing) -- Implement **document structure analysis** with PP-StructureV3 to identify titles, paragraphs, tables, images, formulas -- Extract and **preserve document images** alongside text content -- Support unified input preprocessing (convert any format to images/PDF for OCR processing) -- Implement batch file upload and processing (images: PNG, JPG, PDF files) -- Support multi-language text recognition (Chinese traditional/simplified, English, Japanese, Korean) - 109 languages via PaddleOCR-VL -- Add **Markdown intermediate format** for structured document representation with embedded images -- Implement **searchable PDF generation** from Markdown with images (Pandoc + WeasyPrint) -- Generate PDFs that preserve logical structure and reading order (not exact visual layout) -- Add rule-based output formatting system for organizing extracted text -- Implement multiple export formats (TXT, JSON, Excel, **Markdown with images, searchable PDF**) -- Create web UI with drag-and-drop file upload -- Build RESTful API for OCR processing with progress tracking -- Add background task processing for long-running OCR jobs -- **Reserve translation module architecture** (UI placeholders + API endpoints for future implementation) - -## Impact -- **New capabilities**: - - `ocr-processing`: Core OCR text and image extraction with structure analysis (PaddleOCR-VL + PP-StructureV3) - - `file-management`: File upload, validation, and storage with format standardization - - `export-results`: Multi-format export with custom rules, including searchable PDF with embedded images - - `translation` (reserved): Architecture for future translation features - -- **Affected code**: - - New backend: `app/` (FastAPI application structure) - - New frontend: `frontend/` (React + Vite application) - - New database tables: `ocr_tasks`, `ocr_results`, `export_rules`, `translation_configs` (reserved) - -- **Dependencies**: - - Backend: fastapi, paddleocr (3.0+), paddlepaddle, pdf2image, pandas, pillow, weasyprint, markdown, pandoc (system) - - Frontend: react, vite, tailwindcss, shadcn/ui, axios, react-query - - Translation engines (reserved): argostranslate (offline) or API integration - -- **Configuration**: - - MySQL database connection (external server) - - PaddleOCR-VL model storage (~900MB) and language packs - - Pandoc installation for PDF generation - - Basic CSS template for readable PDF output (not for visual layout replication) - - Image storage directory for extracted images - - File upload size limits and supported formats - - Port configuration (12010 for backend, 12011 for frontend dev) - - Translation service config (reserved for future) diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md deleted file mode 100644 index e57b72e..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md +++ /dev/null @@ -1,175 +0,0 @@ -# Export Results Specification - -## ADDED Requirements - -### Requirement: Plain Text Export -The system SHALL export OCR results as plain text files with configurable formatting. - -#### Scenario: Export single file result as TXT -- **WHEN** user selects a completed OCR task and chooses TXT export -- **THEN** the system generates a .txt file with extracted text -- **AND** preserves line breaks based on bounding box positions -- **AND** returns downloadable file - -#### Scenario: Export batch results as TXT -- **WHEN** user exports a batch with 5 files as TXT -- **THEN** the system creates a ZIP file containing 5 .txt files -- **AND** names each file as `{original_filename}_ocr.txt` -- **AND** returns the ZIP for download - -### Requirement: JSON Export -The system SHALL export OCR results as structured JSON with full metadata. - -#### Scenario: Export with metadata -- **WHEN** user selects JSON export format -- **THEN** the system generates JSON containing: - - File information (name, size, format) - - OCR results array with text, bounding boxes, confidence - - Processing metadata (timestamp, language, model version) - - Task status and statistics - -#### Scenario: JSON export example structure -- **WHEN** export is generated -- **THEN** JSON structure follows this format: -```json -{ - "file_name": "document.png", - "file_size": 1024000, - "upload_time": "2025-01-01T10:00:00Z", - "processing_time": 2.5, - "language": "zh-TW", - "results": [ - { - "text": "範例文字", - "bbox": [100, 50, 200, 80], - "confidence": 0.95 - } - ], - "status": "completed" -} -``` - -### Requirement: Excel Export -The system SHALL export OCR results as Excel spreadsheets with tabular format. - -#### Scenario: Single file Excel export -- **WHEN** user selects Excel export for one file -- **THEN** the system generates .xlsx file with columns: - - Row Number - - Recognized Text - - Confidence Score - - Bounding Box (X, Y, Width, Height) - - Language - -#### Scenario: Batch Excel export with multiple sheets -- **WHEN** user exports batch with 3 files as Excel -- **THEN** the system creates one .xlsx file with 3 sheets -- **AND** names each sheet as the original filename -- **AND** includes summary sheet with statistics - -### Requirement: Rule-Based Output Formatting -The system SHALL apply user-defined rules to format exported text. - -#### Scenario: Group by filename pattern -- **WHEN** user defines rule "group files with prefix 'invoice_'" -- **THEN** the system groups all matching files together -- **AND** exports them in a single combined file or folder - -#### Scenario: Filter by confidence threshold -- **WHEN** user sets export rule "minimum confidence 0.8" -- **THEN** the system excludes text with confidence < 0.8 from export -- **AND** includes only high-confidence results - -#### Scenario: Custom text formatting -- **WHEN** user defines rule "add line numbers" -- **THEN** the system prepends line numbers to each text line -- **AND** formats output as: `1. 第一行文字\n2. 第二行文字` - -#### Scenario: Sort by reading order -- **WHEN** user enables "sort by position" rule -- **THEN** the system orders text by vertical position (top to bottom) -- **AND** then by horizontal position (left to right) within each row -- **AND** exports text in natural reading order - -### Requirement: Export Rule Configuration -The system SHALL allow users to save and reuse export rules. - -#### Scenario: Save custom export rule -- **WHEN** user creates a rule with name "高品質發票輸出" -- **THEN** the system saves the rule to database -- **AND** associates it with the user account -- **AND** makes it available in rule selection dropdown - -#### Scenario: Apply saved rule -- **WHEN** user selects a saved rule for export -- **THEN** the system applies all configured filters and formatting -- **AND** generates output according to rule settings - -#### Scenario: Edit existing rule -- **WHEN** user modifies a saved rule -- **THEN** the system updates the rule configuration -- **AND** preserves the rule ID for continuity - -### Requirement: Markdown Export with Structure and Images -The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images. - -#### Scenario: Export as Markdown with structure and images -- **WHEN** user selects Markdown export format -- **THEN** the system generates .md file with logical structure -- **AND** includes headings, paragraphs, tables, lists in proper hierarchy -- **AND** embeds image references pointing to extracted images (![](./images/img1.jpg)) -- **AND** maintains reading order from OCR analysis -- **AND** includes extracted images in an images/ folder - -#### Scenario: Batch Markdown export with images -- **WHEN** user exports batch with 5 files as Markdown -- **THEN** the system creates 5 separate .md files -- **AND** creates corresponding images/ folders for each document -- **AND** optionally creates combined .md with page separators -- **AND** returns ZIP file containing all Markdown files and images - -### Requirement: Searchable PDF Export with Images -The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout). - -#### Scenario: Single document PDF export with images -- **WHEN** user requests PDF export from OCR result -- **THEN** the system converts Markdown to HTML with basic CSS styling -- **AND** embeds extracted images from images/ folder -- **AND** generates PDF using Pandoc + WeasyPrint -- **AND** preserves document hierarchy, tables, and reading order -- **AND** images appear near their logical position in text flow -- **AND** uses appropriate Chinese font (Noto Sans CJK) -- **AND** produces searchable PDF with selectable text - -#### Scenario: Basic PDF formatting options -- **WHEN** user selects PDF export -- **THEN** the system applies basic readable formatting -- **AND** sets standard margins and page size (A4) -- **AND** uses consistent fonts and spacing -- **AND** ensures images fit within page width -- **NOTE** CSS templates are for basic readability, not for replicating original visual design - -#### Scenario: Batch PDF export with images -- **WHEN** user exports batch as PDF -- **THEN** the system generates individual PDF for each document with embedded images -- **OR** creates single merged PDF with page breaks -- **AND** maintains consistent formatting across all pages -- **AND** returns ZIP of PDFs or single merged PDF - -### Requirement: Export Format Selection -The system SHALL provide UI for selecting export format and options. - -#### Scenario: Format selection with preview -- **WHEN** user opens export dialog -- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**) -- **AND** shows preview of output structure for selected format -- **AND** allows applying custom rules for text filtering -- **AND** provides basic formatting option for PDF (standard readable format) - -#### Scenario: Batch export with format choice -- **WHEN** user selects multiple completed tasks -- **THEN** the system enables batch export button -- **AND** prompts for format selection -- **AND** generates combined export file -- **AND** shows progress bar for PDF generation (slower due to image processing) -- **AND** includes all extracted images when exporting Markdown or PDF diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md deleted file mode 100644 index 0ccee30..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md +++ /dev/null @@ -1,96 +0,0 @@ -# File Management Specification - -## ADDED Requirements - -### Requirement: File Upload Validation -The system SHALL validate uploaded files for type, size, and content before processing. - -#### Scenario: Valid image upload -- **WHEN** user uploads a PNG file of 5MB -- **THEN** the system accepts the file -- **AND** stores it in temporary upload directory -- **AND** returns upload success with file ID - -#### Scenario: Oversized file rejection -- **WHEN** user uploads a file larger than 20MB -- **THEN** the system rejects the file -- **AND** returns error message "文件大小超過限制 (最大 20MB)" -- **AND** does not store the file - -#### Scenario: Invalid file type rejection -- **WHEN** user uploads a .exe or .zip file -- **THEN** the system rejects the file -- **AND** returns error message "不支援的文件類型,僅支援 PNG, JPG, JPEG, PDF" - -#### Scenario: Corrupted image detection -- **WHEN** user uploads a corrupted image file -- **THEN** the system attempts to open the file -- **AND** detects corruption during validation -- **AND** returns error message "文件損壞,無法處理" - -### Requirement: Supported File Formats -The system SHALL support PNG, JPG, JPEG, and PDF file formats for OCR processing. - -#### Scenario: PNG image processing -- **WHEN** user uploads a .png file -- **THEN** the system processes it directly with PaddleOCR - -#### Scenario: JPG/JPEG image processing -- **WHEN** user uploads a .jpg or .jpeg file -- **THEN** the system processes it directly with PaddleOCR - -#### Scenario: PDF file processing -- **WHEN** user uploads a .pdf file -- **THEN** the system converts PDF pages to images using pdf2image -- **AND** processes each page image with PaddleOCR - -### Requirement: Batch Upload Management -The system SHALL manage multiple file uploads with batch organization. - -#### Scenario: Create batch from multiple files -- **WHEN** user uploads 5 files in a single request -- **THEN** the system creates a batch with unique batch_id -- **AND** associates all files with the batch_id -- **AND** returns batch_id and file list - -#### Scenario: Query batch status -- **WHEN** user requests batch status by batch_id -- **THEN** the system returns: - - Total files in batch - - Completed count - - Failed count - - Processing count - - Overall batch status (pending/processing/completed/failed) - -### Requirement: File Storage Management -The system SHALL store uploaded files temporarily and clean up after processing. - -#### Scenario: Temporary file storage -- **WHEN** user uploads files -- **THEN** the system stores files in `uploads/{batch_id}/` directory -- **AND** generates unique filenames to prevent conflicts - -#### Scenario: Automatic cleanup after processing -- **WHEN** OCR processing completes for a batch -- **THEN** the system keeps files for 24 hours -- **AND** automatically deletes files after retention period -- **AND** preserves OCR results in database - -#### Scenario: Manual file deletion -- **WHEN** user requests to delete a batch -- **THEN** the system removes all associated files from storage -- **AND** marks the batch as deleted in database -- **AND** returns deletion confirmation - -### Requirement: File Access Control -The system SHALL ensure users can only access their own uploaded files. - -#### Scenario: User accesses own files -- **WHEN** authenticated user requests file by file_id -- **THEN** the system verifies ownership -- **AND** returns file if user is the owner - -#### Scenario: User attempts to access others' files -- **WHEN** user requests file_id belonging to another user -- **THEN** the system denies access -- **AND** returns 403 Forbidden error diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md deleted file mode 100644 index e2ae3d3..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md +++ /dev/null @@ -1,125 +0,0 @@ -# OCR Processing Specification - -## ADDED Requirements - -### Requirement: Multi-Language Text Recognition with Structure Analysis -The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout). - -#### Scenario: Single image OCR with Chinese text -- **WHEN** user uploads a PNG image containing Chinese text -- **THEN** the system extracts text with bounding boxes and confidence scores -- **AND** returns structured JSON with recognized text, coordinates, and language detected -- **AND** generates Markdown output preserving text layout and hierarchy - -#### Scenario: PDF document OCR with layout preservation -- **WHEN** user uploads a multi-page PDF file -- **THEN** the system processes each page with PaddleOCR-VL -- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas) -- **AND** returns Markdown organized by page with preserved reading order -- **AND** provides JSON with detailed layout structure and bounding boxes - -#### Scenario: Mixed language content -- **WHEN** user uploads an image with both Chinese and English text -- **THEN** the system detects and extracts text in both languages -- **AND** preserves the spatial relationship between text regions -- **AND** maintains proper reading order in output Markdown - -#### Scenario: Complex document with tables and images -- **WHEN** user uploads a scanned document containing tables, images, and text -- **THEN** the system identifies layout elements (text blocks, tables, images, formulas) -- **AND** extracts table structure as Markdown tables -- **AND** extracts and saves document images as separate files -- **AND** embeds image references in Markdown (![](path/to/image.jpg)) -- **AND** preserves document hierarchy and reading order in Markdown output - -### Requirement: Batch Processing -The system SHALL process multiple files concurrently with progress tracking and error handling. - -#### Scenario: Batch upload success -- **WHEN** user uploads 10 image files simultaneously -- **THEN** the system creates a batch task with unique batch ID -- **AND** processes files in parallel (up to configured worker limit) -- **AND** returns real-time progress updates via WebSocket or polling - -#### Scenario: Batch processing with partial failure -- **WHEN** a batch contains 5 valid images and 2 corrupted files -- **THEN** the system processes all valid files successfully -- **AND** logs errors for corrupted files with specific error messages -- **AND** marks the batch as "partially completed" - -### Requirement: Image Preprocessing -The system SHALL provide optional image preprocessing to improve OCR accuracy. - -#### Scenario: Low contrast image enhancement -- **WHEN** user enables preprocessing for a low-contrast image -- **THEN** the system applies contrast adjustment and denoising -- **AND** performs OCR on the enhanced image -- **AND** returns better accuracy compared to original - -#### Scenario: Skipped preprocessing -- **WHEN** user disables preprocessing option -- **THEN** the system performs OCR directly on original image -- **AND** completes processing faster - -### Requirement: Confidence Threshold Filtering -The system SHALL filter OCR results based on configurable confidence threshold. - -#### Scenario: High confidence filter -- **WHEN** user sets confidence threshold to 0.8 -- **THEN** the system returns only text segments with confidence >= 0.8 -- **AND** discards low-confidence results - -#### Scenario: Include all results -- **WHEN** user sets confidence threshold to 0.0 -- **THEN** the system returns all recognized text regardless of confidence -- **AND** includes confidence scores in output - -### Requirement: OCR Result Structure -The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata. - -#### Scenario: Successful OCR result with multiple formats -- **WHEN** OCR processing completes successfully -- **THEN** the system returns JSON containing: - - File metadata (name, size, format, upload timestamp) - - Detected text regions with bounding boxes (x, y, width, height) - - Recognized text content for each region - - Confidence scores (0.0 to 1.0) - - Language detected - - Layout element types (title, paragraph, table, image, formula) - - Reading order sequence - - List of extracted image files with paths - - Processing time - - Task status (completed/failed/partial) -- **AND** generates Markdown file with logical structure -- **AND** saves extracted images to storage directory -- **AND** provides methods to export as searchable PDF with images - -#### Scenario: Searchable PDF generation with images -- **WHEN** user requests PDF export from OCR results -- **THEN** the system converts Markdown to HTML with basic CSS styling -- **AND** embeds extracted images in their logical positions (not exact original positions) -- **AND** generates PDF using Pandoc + WeasyPrint -- **AND** preserves document hierarchy, tables, and reading order -- **AND** applies appropriate fonts for Chinese characters -- **AND** produces searchable PDF (text is selectable and searchable) - -### Requirement: Document Translation (Reserved Architecture) -The system SHALL provide architecture and UI placeholders for future document translation features. - -#### Scenario: Translation option visibility (UI placeholder) -- **WHEN** user views OCR result page -- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon") -- **AND** shows target language selection dropdown (disabled) -- **AND** provides tooltip: "Translation feature will be available in future release" - -#### Scenario: Translation API endpoint (reserved) -- **WHEN** backend API is queried for translation endpoints -- **THEN** the system provides `/api/v1/translate/document` endpoint specification -- **AND** returns "Not Implemented" (501) status when called -- **AND** documents expected request/response format for future implementation - -#### Scenario: Translation configuration storage (database schema) -- **WHEN** database schema is created -- **THEN** the system includes `translation_configs` table -- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at -- **AND** table remains empty until translation feature is implemented diff --git a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/tasks.md b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/tasks.md deleted file mode 100644 index 1d40288..0000000 --- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/tasks.md +++ /dev/null @@ -1,230 +0,0 @@ -# Implementation Tasks - -## Phase 1: Core OCR with Layout Preservation - -### 1. Environment Setup -- [x] 1.1 Create Conda environment with Python 3.10 -- [x] 1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.) -- [x] 1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package) -- [x] 1.4 Download PaddleOCR-VL model (~900MB) and language packs -- [ ] 1.5 Setup frontend project with Vite + React + TypeScript -- [ ] 1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query) -- [x] 1.7 Configure MySQL database connection -- [x] 1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation - -### 2. Database Schema -- [x] 2.1 Create `paddle_ocr_users` table for JWT authentication (id, username, password_hash, etc.) -- [x] 2.2 Create `paddle_ocr_batches` table (id, user_id, status, created_at, completed_at) -- [x] 2.3 Create `paddle_ocr_files` table (id, batch_id, filename, file_path, file_size, status, format) -- [x] 2.4 Create `paddle_ocr_results` table (id, file_id, markdown_path, json_path, layout_data, confidence) -- [x] 2.5 Create `paddle_ocr_export_rules` table (id, user_id, rule_name, config_json, css_template) -- [x] 2.6 Create `paddle_ocr_translation_configs` table (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config) -- [x] 2.7 Write database migration scripts (Alembic) -- [x] 2.8 Add indexes for performance optimization (batch_id, user_id, status) -- Note: All tables use `paddle_ocr_` prefix for namespace isolation - -### 3. Backend - Document Preprocessing -- [x] 3.1 Implement document preprocessor class for format standardization -- [x] 3.2 Add image format validator (PNG, JPG, JPEG) -- [x] 3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support) -- [x] 3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ **Completed via sub-proposal** -- [x] 3.5 Add file corruption detection -- [x] 3.6 Write unit tests for preprocessor - -### 4. Backend - Core OCR Service with PaddleOCR-VL -- [x] 4.1 Implement OCR service class with PaddleOCR-VL initialization -- [x] 4.2 Configure layout detection (use_layout_detection=True) -- [x] 4.3 Implement single image/PDF OCR processing -- [x] 4.4 Parse OCR output to extract Markdown and JSON -- [x] 4.5 Store Markdown files with preserved layout structure -- [x] 4.6 Store JSON with detailed bounding boxes and layout metadata -- [x] 4.7 Add confidence threshold filtering -- [x] 4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks) -- [x] 4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints) -- [x] 4.10 Write unit tests for OCR service - -### 5. Backend - Layout-Preserved PDF Generation -- [x] 5.1 Create PDF generator service using Pandoc + WeasyPrint -- [x] 5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.) -- [x] 5.3 Create default CSS template for layout preservation -- [x] 5.4 Create additional CSS templates (academic, business, report) -- [x] 5.5 Add Chinese font configuration (Noto Sans CJK) -- [x] 5.6 Implement PDF generation via Pandoc command -- [x] 5.7 Add fallback: Python WeasyPrint direct generation -- [x] 5.8 Handle multi-page PDF merging -- [x] 5.9 Write unit tests for PDF generator - -### 6. Backend - File Management -- [x] 6.1 Implement file upload validation (type, size, corruption check) -- [x] 6.2 Create file storage service with temporary directory management -- [x] 6.3 Add batch upload handler with unique batch_id generation -- [x] 6.4 Implement file access control and ownership verification -- [x] 6.5 Add automatic cleanup job for expired files (24-hour retention) -- [x] 6.6 Store Markdown and JSON outputs in organized directory structure -- [x] 6.7 Write unit tests for file management - -### 7. Backend - Export Service -- [x] 7.1 Implement plain text export from Markdown -- [x] 7.2 Implement JSON export with full metadata -- [x] 7.3 Implement Excel export using pandas -- [x] 7.4 Implement Markdown export (direct from OCR output) -- [x] 7.5 Implement layout-preserved PDF export (using PDF generator service) -- [x] 7.6 Add ZIP file creation for batch exports -- [x] 7.7 Implement rule-based filtering (confidence threshold, filename pattern) -- [x] 7.8 Implement rule-based formatting (line numbers, sort by position) -- [x] 7.9 Create export rule CRUD operations (save, load, update, delete) -- [x] 7.10 Write unit tests for export service - -### 8. Backend - API Endpoints -- [x] 8.1 POST `/api/v1/auth/login` - JWT authentication -- [x] 8.2 POST `/api/v1/upload` - File upload with validation -- [x] 8.3 POST `/api/v1/ocr/process` - Trigger OCR processing (PaddleOCR-VL) -- [x] 8.4 GET `/api/v1/ocr/status/{task_id}` - Get task status with progress -- [x] 8.5 GET `/api/v1/ocr/result/{task_id}` - Get OCR results (JSON + Markdown) -- [x] 8.6 GET `/api/v1/batch/{batch_id}/status` - Get batch status -- [x] 8.7 POST `/api/v1/export` - Export results with format and rules -- [x] 8.8 GET `/api/v1/export/pdf/{file_id}` - Generate and download layout-preserved PDF -- [x] 8.9 GET `/api/v1/export/rules` - List saved export rules -- [x] 8.10 POST `/api/v1/export/rules` - Create new export rule -- [x] 8.11 PUT `/api/v1/export/rules/{rule_id}` - Update export rule -- [x] 8.12 DELETE `/api/v1/export/rules/{rule_id}` - Delete export rule -- [x] 8.13 GET `/api/v1/export/css-templates` - List available CSS templates -- [x] 8.14 Write API integration tests - -### 9. Backend - Translation Architecture (RESERVED) -- [x] 9.1 Create translation service interface (abstract class) -- [x] 9.2 Implement stub endpoint POST `/api/v1/translate/document` (returns 501 Not Implemented) -- [x] 9.3 Document expected request/response format in OpenAPI spec -- [x] 9.4 Add translation_configs table migrations (completed in Task 2.6) -- [x] 9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google) -- [ ] 9.6 Write unit tests for translation service interface (optional for stub) - -### 10. Backend - Background Tasks -- [x] 10.1 Implement FastAPI BackgroundTasks for async OCR processing -- [ ] 10.2 Add task queue system (optional: Redis-based queue) -- [x] 10.3 Implement progress updates (polling endpoint) -- [x] 10.4 Add error handling and retry logic -- [x] 10.5 Implement cleanup scheduler for expired files -- [x] 10.6 Add PDF generation to background tasks (slower process) - -## Phase 2: Frontend Development - -### 11. Frontend - Project Structure -- [x] 11.1 Setup Vite project with TypeScript support -- [x] 11.2 Configure Tailwind CSS and shadcn/ui -- [x] 11.3 Setup React Router for navigation -- [x] 11.4 Configure Axios with base URL and interceptors -- [x] 11.5 Setup React Query for API state management -- [x] 11.6 Create Zustand store for global state -- [x] 11.7 Setup i18n for Traditional Chinese interface - -### 12. Frontend - UI Components (shadcn/ui) -- [x] 12.1 Install and configure shadcn/ui components -- [x] 12.2 Create FileUpload component with drag-and-drop (react-dropzone) -- [x] 12.3 Create ProgressBar component for batch processing -- [x] 12.4 Create ResultsTable component for displaying OCR results -- [x] 12.5 Create MarkdownPreview component for viewing extracted content ⬅️ **Fixed: API schema alignment for filename display** -- [ ] 12.6 Create ExportDialog component for format and rule selection -- [ ] 12.7 Create CSSTemplateSelector component for PDF styling -- [ ] 12.8 Create RuleEditor component for creating custom rules -- [x] 12.9 Create Toast notifications for feedback -- [ ] 12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label) - -### 13. Frontend - Pages -- [x] 13.1 Create Login page with JWT authentication -- [x] 13.2 Create Upload page with file selection and batch management ⬅️ **Fixed: Upload response schema alignment** -- [x] 13.3 Create Processing page with real-time progress ⬅️ **Fixed: Error field mapping** -- [x] 13.4 Create Results page with Markdown/JSON preview ⬅️ **Fixed: OCR result detail flattening, null safety** -- [x] 13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF) -- [ ] 13.6 Create PDF Preview page (optional: embedded PDF viewer) -- [x] 13.7 Create Settings page for export rule management -- [x] 13.8 Add translation option placeholder in Results page (disabled state) - -### 14. Frontend - API Integration -- [x] 14.1 Create API client service with typed interfaces ⬅️ **Updated: All endpoints verified working** -- [x] 14.2 Implement file upload with progress tracking ⬅️ **Fixed: UploadBatchResponse schema** -- [x] 14.3 Implement OCR task status polling ⬅️ **Fixed: BatchStatusResponse with files array** -- [x] 14.4 Implement results fetching (Markdown + JSON display) ⬅️ **Fixed: OCRResultDetailResponse with flattened structure** -- [x] 14.5 Implement export with file download ⬅️ **Fixed: ExportOptions schema added** -- [x] 14.6 Implement PDF generation request with loading indicator -- [x] 14.7 Implement rule CRUD operations -- [x] 14.8 Implement CSS template selection ⬅️ **Fixed: CSSTemplateResponse with filename field** -- [x] 14.9 Add error handling and user feedback ⬅️ **Fixed: Error field mapping with validation_alias** -- [x] 14.10 Create translation API client (stub, for future use) - -## Phase 3: Testing & Optimization - -### 15. Testing -- [ ] 15.1 Write backend unit tests (pytest) for all services -- [ ] 15.2 Write backend API integration tests -- [ ] 15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content) -- [ ] 15.4 Test layout preservation quality (Markdown structure correctness) -- [ ] 15.5 Test PDF generation with different CSS templates -- [ ] 15.6 Test Chinese font rendering in generated PDFs -- [ ] 15.7 Write frontend component tests (Vitest) -- [ ] 15.8 Perform manual end-to-end testing -- [ ] 15.9 Test with various image formats and languages -- [ ] 15.10 Test batch processing with large file sets (50+ files) -- [ ] 15.11 Test export with different formats and rules -- [x] 15.12 Verify translation UI placeholders are properly disabled - -### 16. Documentation -- [ ] 16.1 Write API documentation (FastAPI auto-docs + additional notes) -- [ ] 16.2 Document PaddleOCR-VL model requirements and installation -- [ ] 16.3 Document Pandoc and WeasyPrint setup -- [ ] 16.4 Create CSS template customization guide -- [ ] 16.5 Write user guide for web interface -- [ ] 16.6 Write deployment guide for 1Panel -- [ ] 16.7 Create README.md with setup instructions -- [ ] 16.8 Document export rule syntax and examples -- [ ] 16.9 Document translation feature roadmap and architecture - -## Phase 4: Deployment - -### 17. Deployment Preparation -- [ ] 17.1 Create backend startup script (start.sh) -- [ ] 17.2 Create frontend build script (build.sh) -- [ ] 17.3 Create Nginx configuration file (static files + reverse proxy) -- [ ] 17.4 Create Supervisor configuration for backend process -- [ ] 17.5 Create environment variable templates (.env.example) -- [ ] 17.6 Create deployment automation script (deploy.sh) -- [ ] 17.7 Prepare CSS templates for production -- [ ] 17.8 Test deployment on staging environment - -### 18. Production Deployment (1Panel) -- [ ] 18.1 Setup Conda environment on production server -- [ ] 18.2 Install system dependencies (pandoc, fonts-noto-cjk) -- [ ] 18.3 Install Python dependencies and download PaddleOCR-VL models -- [ ] 18.4 Configure MySQL database connection -- [ ] 18.5 Build frontend static files -- [ ] 18.6 Configure Nginx via 1Panel (static files + reverse proxy) -- [ ] 18.7 Setup Supervisor to manage backend process -- [ ] 18.8 Configure SSL certificate (Let's Encrypt via 1Panel) -- [ ] 18.9 Perform production smoke tests (upload, OCR, export PDF) -- [ ] 18.10 Setup monitoring and logging -- [ ] 18.11 Verify PDF generation works in production environment - -## Phase 5: Translation Feature (FUTURE) - -### 19. Translation Implementation (Post-Launch) -- [ ] 19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API) -- [ ] 19.2 Implement chosen translation engine integration -- [ ] 19.3 Implement Markdown translation with structure preservation -- [ ] 19.4 Update POST `/api/v1/translate/document` endpoint (remove 501 status) -- [ ] 19.5 Add translation configuration UI (enable TranslationPanel component) -- [ ] 19.6 Add source/target language selection -- [ ] 19.7 Implement translation progress tracking -- [ ] 19.8 Test translation with various document types -- [ ] 19.9 Optimize translation quality for technical documents -- [ ] 19.10 Update documentation with translation feature guide - -## Summary - -**Phase 1 (Core OCR + Layout Preservation)**: Tasks 1-10 (基礎 OCR + 版面保留 PDF) -**Phase 2 (Frontend)**: Tasks 11-14 (用戶界面) -**Phase 3 (Testing)**: Tasks 15-16 (測試與文檔) -**Phase 4 (Deployment)**: Tasks 17-18 (部署) -**Phase 5 (Translation)**: Task 19 (翻譯功能 - 未來實現) - -**Total Tasks**: 150+ tasks -**Priority**: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback diff --git a/openspec/changes/archive/2025-11-18-add-office-document-support/IMPLEMENTATION.md b/openspec/changes/archive/2025-11-18-add-office-document-support/IMPLEMENTATION.md deleted file mode 100644 index d5b7f71..0000000 --- a/openspec/changes/archive/2025-11-18-add-office-document-support/IMPLEMENTATION.md +++ /dev/null @@ -1,122 +0,0 @@ -# Implementation Summary: Add Office Document Support - -## Status: ✅ COMPLETED - -## Overview -Successfully implemented Office document (DOC, DOCX, PPT, PPTX) support in the OCR processing pipeline and extended JWT token validity to 24 hours. - -## Implementation Details - -### 1. Office Document Conversion (Phase 2) -**File**: `backend/app/services/office_converter.py` -- Implemented LibreOffice-based conversion service -- Supports: DOC, DOCX, PPT, PPTX → PDF -- Headless mode for server deployment -- Comprehensive error handling and logging - -### 2. File Validation & MIME Type Support (Phase 3) -**File**: `backend/app/services/preprocessor.py` -- Added Office document MIME type mappings: - - `application/msword` → doc - - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` → docx - - `application/vnd.ms-powerpoint` → ppt - - `application/vnd.openxmlformats-officedocument.presentationml.presentation` → pptx -- Implemented ZIP-based integrity validation for modern Office formats (DOCX, PPTX) -- Fixed return value order bug in file_manager.py:237 - -### 3. OCR Service Integration (Phase 3) -**File**: `backend/app/services/ocr_service.py` -- Integrated Office → PDF → Images → OCR pipeline -- Automatic format detection and routing -- Maintains existing OCR quality for all formats - -### 4. Configuration Updates (Phase 1 & Phase 5) -**Files**: -- `backend/app/core/config.py`: Updated default `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 -- `.env`: Added Office formats to `ALLOWED_EXTENSIONS` -- Fixed environment variable precedence issues - -### 5. Testing Infrastructure (Phase 5) -**Files**: -- `demo_docs/office_tests/create_docx.py`: Test document generator -- `demo_docs/office_tests/test_office_upload.py`: End-to-end integration test -- Fixed API endpoint paths to match actual router implementation - -## Bugs Fixed During Implementation - -1. **Configuration Loading Bug**: `.env` file was overriding default config values - - **Fix**: Updated `.env` to include Office formats - - **Impact**: Critical - blocked all Office document processing - -2. **Return Value Order Bug** (`file_manager.py:237`): - - **Issue**: Unpacking preprocessor return values in wrong order - - **Error**: "Data too long for column 'file_format'" - - **Fix**: Changed from `(is_valid, error_msg, format)` to `(is_valid, format, error_msg)` - -3. **Missing MIME Types** (`preprocessor.py:80-95`): - - **Issue**: Office MIME types not recognized - - **Fix**: Added complete Office MIME type mappings - -4. **Missing Integrity Validation** (`preprocessor.py:126-141`): - - **Issue**: No validation logic for Office formats - - **Fix**: Implemented ZIP-based validation for DOCX/PPTX - -5. **API Endpoint Mismatch** (`test_office_upload.py`): - - **Issue**: Test script using incorrect API paths - - **Fix**: Updated to use `/api/v1/upload` (combined batch creation + upload) - -## Test Results - -### End-to-End Test (Batch 24) -- **File**: test_document.docx (1,521 bytes) -- **Status**: ✅ Completed Successfully -- **Processing Time**: 375.23 seconds (includes PaddleOCR model initialization) -- **OCR Accuracy**: 97.39% confidence -- **Text Regions**: 20 regions detected -- **Language**: Chinese (mixed with English) - -### Content Verification -Successfully extracted all content from test document: -- ✅ Chinese headings: "測試文件說明", "處理流程" -- ✅ English headings: "Office Document OCR Test", "Technical Information" -- ✅ Mixed content: Numbers (1234567890), technical terms -- ✅ Bullet points and numbered lists -- ✅ Multi-line paragraphs - -### Processing Pipeline Verified -1. ✅ DOCX upload and validation -2. ✅ DOCX → PDF conversion (LibreOffice) -3. ✅ PDF → Images conversion -4. ✅ OCR processing (PaddleOCR with structure analysis) -5. ✅ Markdown output generation - -## Success Criteria Met - -| Criterion | Status | Evidence | -|-----------|--------|----------| -| Process Word documents (.doc, .docx) | ✅ | Batch 24 completed with 97.39% accuracy | -| Process PowerPoint documents (.ppt, .pptx) | ✅ | Converter implemented, same pipeline as Word | -| JWT tokens valid for 24 hours | ✅ | Config updated, login response shows 1440 minutes | -| Existing functionality preserved | ✅ | No breaking changes to API or data models | -| Conversion maintains OCR quality | ✅ | High confidence score (97.39%) on test document | - -## Performance Metrics -- **First run**: ~375 seconds (includes model download/initialization) -- **Subsequent runs**: Expected ~30-60 seconds (LibreOffice conversion + OCR) -- **Memory usage**: Acceptable (within normal PaddleOCR requirements) -- **Accuracy**: 97.39% on mixed Chinese/English content - -## Dependencies Installed -- LibreOffice (via Homebrew): `/Applications/LibreOffice.app` -- No additional Python packages required (leveraged existing PDF2Image + PaddleOCR) - -## Breaking Changes -None - all changes are backward compatible. - -## Remaining Optional Work (Phase 6) -- [ ] Update README documentation -- [ ] Add OpenAPI schema examples for Office formats -- [ ] Add API endpoint documentation strings - -## Conclusion -The Office document support feature has been successfully implemented and tested. All core functionality is working as expected with high OCR accuracy. The system now supports the complete range of common document formats: images (PNG, JPG, BMP, TIFF), PDF, and Office documents (DOC, DOCX, PPT, PPTX). diff --git a/openspec/changes/archive/2025-11-18-add-office-document-support/design.md b/openspec/changes/archive/2025-11-18-add-office-document-support/design.md deleted file mode 100644 index a07f393..0000000 --- a/openspec/changes/archive/2025-11-18-add-office-document-support/design.md +++ /dev/null @@ -1,176 +0,0 @@ -# Technical Design - -## Architecture Overview - -``` -User Upload (DOC/DOCX/PPT/PPTX) - ↓ -File Validation & Storage - ↓ -Format Detection - ↓ -Office Document Converter - ↓ -PDF Generation - ↓ -PDF to Images (existing) - ↓ -PaddleOCR Processing (existing) - ↓ -Results & Export -``` - -## Component Design - -### 1. Office Document Converter Service - -```python -# app/services/office_converter.py - -class OfficeConverter: - """Convert Office documents to PDF for OCR processing""" - - def convert_to_pdf(self, file_path: Path) -> Path: - """Main conversion dispatcher""" - - def convert_docx_to_pdf(self, docx_path: Path) -> Path: - """Convert DOCX to PDF using python-docx and pypandoc""" - - def convert_doc_to_pdf(self, doc_path: Path) -> Path: - """Convert legacy DOC to PDF""" - - def convert_pptx_to_pdf(self, pptx_path: Path) -> Path: - """Convert PPTX to PDF using python-pptx""" - - def convert_ppt_to_pdf(self, ppt_path: Path) -> Path: - """Convert legacy PPT to PDF""" -``` - -### 2. OCR Service Integration - -```python -# Extend app/services/ocr_service.py - -def process_image(self, image_path: Path, ...): - # Check file type - if is_office_document(image_path): - # Convert to PDF first - pdf_path = self.office_converter.convert_to_pdf(image_path) - # Use existing PDF processing - return self.process_pdf(pdf_path, ...) - elif is_pdf: - # Existing PDF processing - ... - else: - # Existing image processing - ... -``` - -### 3. File Format Detection - -```python -OFFICE_FORMATS = { - '.doc': 'application/msword', - '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', - '.ppt': 'application/vnd.ms-powerpoint', - '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation' -} - -def is_office_document(file_path: Path) -> bool: - return file_path.suffix.lower() in OFFICE_FORMATS -``` - -## Library Selection - -### For Word Documents -- **python-docx**: Read/write DOCX files -- **doc2pdf**: Simple conversion (requires LibreOffice) -- Alternative: **pypandoc** with pandoc backend - -### For PowerPoint Documents -- **python-pptx**: Read/write PPTX files -- **unoconv**: Universal Office Converter (requires LibreOffice) - -### Recommended Approach -Use **LibreOffice** headless mode for universal conversion: -```bash -libreoffice --headless --convert-to pdf input.docx -``` - -This provides: -- Support for all Office formats -- High fidelity conversion -- Maintained by active community - -## Configuration Changes - -### Token Expiration -```python -# app/core/config.py -class Settings(BaseSettings): - # Change from 30 to 1440 (24 hours) - access_token_expire_minutes: int = 1440 -``` - -### File Upload Limits -```python -# Consider Office files can be larger -max_file_size: int = 100 * 1024 * 1024 # 100MB -allowed_extensions: Set[str] = { - '.png', '.jpg', '.jpeg', '.pdf', - '.doc', '.docx', '.ppt', '.pptx' -} -``` - -## Error Handling - -1. **Conversion Failures** - - Corrupted Office files - - Unsupported Office features - - LibreOffice not installed - -2. **Performance Considerations** - - Office conversion is CPU intensive - - Consider queuing for large files - - Add conversion timeout (60 seconds) - -3. **Security** - - Validate Office files before processing - - Scan for macros/embedded objects - - Sandbox conversion process - -## Dependencies - -### System Requirements -```bash -# macOS -brew install libreoffice - -# Linux -apt-get install libreoffice - -# Python packages -pip install python-docx python-pptx pypandoc -``` - -### Alternative: Docker Container -Use a Docker container with LibreOffice pre-installed for consistent conversion across environments. - -## Testing Strategy - -1. **Unit Tests** - - Test each conversion method - - Mock LibreOffice calls - - Test error handling - -2. **Integration Tests** - - End-to-end Office → OCR pipeline - - Test with various Office versions - - Performance benchmarks - -3. **Sample Documents** - - Simple text documents - - Documents with tables - - Documents with images - - Presentations with multiple slides - - Legacy formats (DOC, PPT) \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md b/openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md deleted file mode 100644 index f8c89e3..0000000 --- a/openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md +++ /dev/null @@ -1,52 +0,0 @@ -# Add Office Document Support - -**Status**: ✅ IMPLEMENTED & TESTED - -## Summary -Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day. - -## Motivation -Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will: -1. Enable processing of Word and PowerPoint documents -2. Improve user experience by extending token validity -3. Leverage existing PDF-to-image conversion infrastructure - -## Proposed Solution - -### 1. Office Document Support -- Add Python libraries for Office document conversion: - - `python-docx2pdf` or `python-docx` + `pypandoc` for Word documents - - `python-pptx` for PowerPoint documents -- Implement conversion pipeline: - - Option A: Office → PDF → Images → OCR - - Option B: Office → Images → OCR (direct conversion) -- Extend file validation to accept `.doc`, `.docx`, `.ppt`, `.pptx` formats -- Add conversion methods to `OCRService` class - -### 2. Token Validity Extension -- Update `ACCESS_TOKEN_EXPIRE_MINUTES` from 30 minutes to 1440 minutes (24 hours) -- Ensure security measures are in place for longer-lived tokens - -## Impact Analysis -- **Backend Services**: Minimal changes to existing OCR processing flow -- **Dependencies**: New Python packages for Office document handling -- **Performance**: Slight increase in processing time for document conversion -- **Security**: Longer token validity requires careful consideration -- **Storage**: Temporary files during conversion process - -## Success Criteria -1. Successfully process Word documents (.doc, .docx) with OCR -2. Successfully process PowerPoint documents (.ppt, .pptx) with OCR -3. JWT tokens remain valid for 24 hours -4. All existing functionality continues to work -5. Conversion quality maintains text readability for OCR - -## Timeline -- Implementation: 2-3 hours ✅ -- Testing: 1 hour ✅ -- Documentation: 30 mins ✅ -- Total: ~4 hours ✅ COMPLETED - -## Actual Time -- Total development time: ~6 hours (including debugging and testing) -- Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-add-office-document-support/specs/file-processing/spec.md b/openspec/changes/archive/2025-11-18-add-office-document-support/specs/file-processing/spec.md deleted file mode 100644 index 4f08325..0000000 --- a/openspec/changes/archive/2025-11-18-add-office-document-support/specs/file-processing/spec.md +++ /dev/null @@ -1,54 +0,0 @@ -# File Processing Specification Delta - -## ADDED Requirements - -### Requirement: Office Document Support - -The system SHALL support processing of Microsoft Office document formats including Word documents (.doc, .docx) and PowerPoint presentations (.ppt, .pptx). - -#### Scenario: Upload and Process Word Document -Given a user has a Word document containing text and tables -When the user uploads the `.docx` file -Then the system converts it to PDF format -And extracts all text using OCR -And preserves table structure in the output - -#### Scenario: Upload and Process PowerPoint -Given a user has a PowerPoint presentation with multiple slides -When the user uploads the `.pptx` file -Then the system converts each slide to an image -And performs OCR on each slide -And maintains slide order in the results - -### Requirement: Document Conversion Pipeline - -The system SHALL implement a multi-stage conversion pipeline for Office documents using LibreOffice or equivalent tools. - -#### Scenario: Conversion Error Handling -Given an Office document with unsupported features -When the conversion process encounters an error -Then the system logs the specific error details -And returns a user-friendly error message -And marks the file as failed with reason - -## MODIFIED Requirements - -### Requirement: File Validation - -The file validation module SHALL accept Office document formats in addition to existing image and PDF formats, including .doc, .docx, .ppt, and .pptx extensions. - -#### Scenario: Validate Office File Upload -Given a user attempts to upload a file -When the file extension is `.docx` or `.pptx` -Then the system accepts the file for processing -And validates the MIME type matches the extension - -### Requirement: JWT Token Validity - -The JWT token validity period SHALL be extended from 30 minutes to 1440 minutes (24 hours) to improve user experience. - -#### Scenario: Extended Token Usage -Given a user authenticates successfully -When they receive a JWT token -Then the token remains valid for 24 hours -And allows continuous API access without re-authentication \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md b/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md deleted file mode 100644 index ec0a58f..0000000 --- a/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md +++ /dev/null @@ -1,70 +0,0 @@ -# Implementation Tasks - -## Phase 1: Dependencies & Configuration -- [x] Install Office document processing libraries - - [x] Install LibreOffice via Homebrew (headless mode for conversion) - - [x] Verify LibreOffice installation and accessibility - - [x] Configure LibreOffice path in OfficeConverter -- [x] Update JWT token configuration - - [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py` - - [x] Verify token expiration in authentication flow - -## Phase 2: Document Conversion Implementation -- [x] Create Office document converter class - - [x] Add `office_converter.py` to services directory - - [x] Implement Word document conversion methods - - [x] `convert_docx_to_pdf()` for DOCX files - - [x] `convert_doc_to_pdf()` for DOC files - - [x] Implement PowerPoint conversion methods - - [x] `convert_pptx_to_pdf()` for PPTX files - - [x] `convert_ppt_to_pdf()` for PPT files - - [x] Add error handling and logging - - [x] Add file validation methods - -## Phase 3: OCR Service Integration -- [x] Update OCR service to handle Office formats - - [x] Modify `process_image()` in `ocr_service.py` - - [x] Add Office format detection logic - - [x] Integrate Office-to-PDF conversion pipeline - - [x] Update supported formats list in configuration -- [x] Update file manager service - - [x] Add Office formats to allowed extensions (`file_manager.py`) - - [x] Update file validation logic - - [x] Update config.py allowed extensions - -## Phase 4: API Updates -- [x] File validation updated (already accepts Office formats via file_manager.py) -- [x] Core API integration complete (Office files processed via existing endpoints) -- [ ] API documentation strings (optional enhancement) -- [ ] Add Office format examples to OpenAPI schema (optional enhancement) - -## Phase 5: Testing -- [x] Create test Office documents - - [x] Sample DOCX with mixed Chinese/English content - - [x] Test document creation script (`create_docx.py`) -- [x] Verify document conversion capability - - [x] LibreOffice headless mode verified - - [x] OfficeConverter service tested -- [x] Test token validity - - [x] Verified 24-hour token expiration (1440 minutes) - - [x] Confirmed in login response -- [x] Core functionality verified - - [x] Office format detection working - - [x] Office → PDF → Images → OCR pipeline implemented - - [x] File validation accepts .doc, .docx, .ppt, .pptx -- [x] Automated integration testing - - [x] Fixed API endpoint paths in test script - - [x] Fixed configuration loading (.env file update) - - [x] Fixed preprocessor bugs (MIME types, validation, return order) - - [x] End-to-end test completed successfully (batch 24) - - [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content -- [x] Manual end-to-end testing - - [x] DOCX → PDF → Images → OCR pipeline verified - - [x] Processing time: ~375 seconds (includes model initialization) - - [x] Result output format validated (Markdown generation working) - -## Phase 6: Documentation -- [x] Update README with Office format support (covered in IMPLEMENTATION.md) -- [x] Test documents available in demo_docs/office_tests/ -- [x] API documentation update (endpoints unchanged, format list extended) -- [x] Migration guide (no breaking changes, backward compatible) \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/ARCHITECTURE-REFACTOR-PLAN.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/ARCHITECTURE-REFACTOR-PLAN.md deleted file mode 100644 index 9ebcad5..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/ARCHITECTURE-REFACTOR-PLAN.md +++ /dev/null @@ -1,817 +0,0 @@ -# Tool_OCR 架構大改方案 -## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃 - -**規劃日期**: 2025-01-18 -**硬體配置**: RTX 4060 8GB VRAM -**優先級**: P0 (最高) - ---- - -## 📊 現狀分析 - -### 目前架構的問題 - -#### 1. **PP-StructureV3 能力嚴重浪費** -```python -# ❌ 目前實作 (ocr_service.py:614-646) -markdown_dict = page_result.markdown # 只用簡化版 -markdown_texts = markdown_dict.get('markdown_texts', '') -'bbox': [], # 座標全部為空! -``` - -**問題**: -- 只使用了 ~20% 的 PP-StructureV3 功能 -- 未使用 `parsing_res_list`(核心數據結構) -- 未使用 `layout_bbox`(精確座標) -- 未使用 `reading_order`(閱讀順序) -- 未使用 23 種版面元素分類 - -#### 2. **GPU 配置未優化** -```python -# 目前配置 (ocr_service.py:211-219) -self.structure_engine = PPStructureV3( - use_doc_orientation_classify=False, # ❌ 未啟用前處理 - use_doc_unwarping=False, # ❌ 未啟用矯正 - use_textline_orientation=False, # ❌ 未啟用方向校正 - # ... 使用預設配置 -) -``` - -**問題**: -- RTX 4060 8GB 足以運行 server 模型,但用了預設配置 -- 關閉了重要的前處理功能 -- 未充分利用 GPU 算力 - -#### 3. **PDF 生成策略單一** -```python -# 目前只有座標定位模式 -# 導致 21.6% 文字損失(過濾重疊) -filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid) -``` - -**問題**: -- 只支援座標定位,不支援流式排版 -- 無法零資訊損失 -- 翻譯功能受限 - ---- - -## 🎯 重構目標 - -### 核心目標 - -1. **完整利用 PP-StructureV3 能力** - - 提取 `parsing_res_list`(23 種元素分類 + 閱讀順序) - - 提取 `layout_bbox`(精確座標) - - 提取 `layout_det_res`(版面檢測詳情) - - 提取 `overall_ocr_res`(所有文字的座標) - -2. **雙模式 PDF 生成** - - 模式 A: 座標定位(精確還原版面) - - 模式 B: 流式排版(零資訊損失,支援翻譯) - -3. **GPU 配置最佳化** - - 針對 RTX 4060 8GB 的最佳配置 - - Server 模型 + 所有功能模組 - - 合理的記憶體管理 - -4. **向後相容** - - 保留現有 API - - 舊 JSON 檔案仍可用 - - 漸進式升級 - ---- - -## 🏗️ 新架構設計 - -### 架構層次 - -``` -┌──────────────────────────────────────────────────────┐ -│ API Layer │ -│ /tasks, /results, /download (向後相容) │ -└────────────────┬─────────────────────────────────────┘ - │ -┌────────────────▼─────────────────────────────────────┐ -│ Service Layer │ -├──────────────────────────────────────────────────────┤ -│ OCRService (現有, 保留) │ -│ └─ analyze_layout() [升級] ──┐ │ -│ │ │ -│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │ -│ └─ extract_complete_layout() ─┘ │ -│ │ -│ PDFGeneratorService (重構) │ -│ ├─ generate_coordinate_pdf() [Mode A] │ -│ └─ generate_flow_pdf() [Mode B] │ -└────────────────┬─────────────────────────────────────┘ - │ -┌────────────────▼─────────────────────────────────────┐ -│ Engine Layer │ -├──────────────────────────────────────────────────────┤ -│ PPStructureV3Engine (新增,統一管理) │ -│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │ -│ ├─ Model 配置 (Server 模型) │ -│ └─ 功能開關 (全功能啟用) │ -└──────────────────────────────────────────────────────┘ -``` - -### 核心類別設計 - -#### 1. PPStructureV3Engine (新增) -**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化 - -```python -class PPStructureV3Engine: - """ - PP-StructureV3 引擎管理器 (單例) - 針對 RTX 4060 8GB 優化配置 - """ - _instance = None - - def __new__(cls): - if cls._instance is None: - cls._instance = super().__new__(cls) - cls._instance._initialize() - return cls._instance - - def _initialize(self): - """初始化引擎""" - logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config") - - self.engine = PPStructureV3( - # ===== GPU 配置 ===== - use_gpu=True, - gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB) - - # ===== 前處理模組 (全部啟用) ===== - use_doc_orientation_classify=True, # 文檔方向校正 - use_doc_unwarping=True, # 文檔影像矯正 - use_textline_orientation=True, # 文字行方向校正 - - # ===== 功能模組 (全部啟用) ===== - use_table_recognition=True, # 表格識別 - use_formula_recognition=True, # 公式識別 - use_chart_recognition=True, # 圖表識別 - use_seal_recognition=True, # 印章識別 - - # ===== OCR 模型配置 (Server 模型) ===== - text_detection_model_name="ch_PP-OCRv4_server_det", - text_recognition_model_name="ch_PP-OCRv4_server_rec", - - # ===== 版面檢測參數 ===== - layout_threshold=0.5, # 版面檢測閾值 - layout_nms=0.5, # NMS 閾值 - layout_unclip_ratio=1.5, # 邊界框擴展比例 - - # ===== OCR 參數 ===== - text_det_limit_side_len=1920, # 高解析度檢測 - text_det_thresh=0.3, # 檢測閾值 - text_det_box_thresh=0.5, # 邊界框閾值 - - # ===== 其他 ===== - show_log=True, - use_angle_cls=False, # 已被 textline_orientation 取代 - ) - - logger.info("PP-StructureV3 engine initialized successfully") - logger.info(f" - GPU: Enabled (RTX 4060 8GB)") - logger.info(f" - Models: Server (High Accuracy)") - logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)") - - def predict(self, image_path: str): - """執行預測""" - return self.engine.predict(image_path) - - def get_engine(self): - """獲取引擎實例""" - return self.engine -``` - -#### 2. AdvancedLayoutExtractor (新增) -**目的**: 完整提取 PP-StructureV3 的所有版面資訊 - -```python -class AdvancedLayoutExtractor: - """ - 進階版面提取器 - 完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res - """ - - def __init__(self): - self.engine = PPStructureV3Engine() - - def extract_complete_layout( - self, - image_path: Path, - output_dir: Optional[Path] = None, - current_page: int = 0 - ) -> Tuple[Optional[Dict], List[Dict]]: - """ - 提取完整版面資訊(使用 page_result.json) - - Returns: - (layout_data, images_metadata) - - layout_data = { - "elements": [ - { - "element_id": int, - "type": str, # 23 種類型之一 - "bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表 - "content": str, - "reading_order": int, # ✅ 閱讀順序 - "layout_type": str, # ✅ single/double/multi-column - "confidence": float, # ✅ 置信度 - "page": int - }, - ... - ], - "reading_order": [0, 1, 2, ...], - "layout_types": ["single", "double"], - "total_elements": int - } - """ - try: - results = self.engine.predict(str(image_path)) - - layout_elements = [] - images_metadata = [] - - for page_idx, page_result in enumerate(results): - # ✅ 核心改動:使用 page_result.json 而非 page_result.markdown - json_data = page_result.json - - # ===== 方法 1: 使用 parsing_res_list (主要來源) ===== - parsing_res_list = json_data.get('parsing_res_list', []) - - if parsing_res_list: - logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list") - - for idx, item in enumerate(parsing_res_list): - element = self._create_element_from_parsing_res( - item, idx, current_page - ) - if element: - layout_elements.append(element) - - # ===== 方法 2: 使用 layout_det_res (補充資訊) ===== - layout_det_res = json_data.get('layout_det_res', {}) - layout_boxes = layout_det_res.get('boxes', []) - - # 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位) - self._enrich_elements_with_layout_det(layout_elements, layout_boxes) - - # ===== 方法 3: 處理圖片 (從 markdown_images) ===== - markdown_dict = page_result.markdown - markdown_images = markdown_dict.get('markdown_images', {}) - - for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()): - # 保存圖片到磁碟 - self._save_image(img_obj, img_path, output_dir or image_path.parent) - - # 從 parsing_res_list 或 layout_det_res 查找 bbox - bbox = self._find_image_bbox( - img_path, parsing_res_list, layout_boxes - ) - - images_metadata.append({ - 'element_id': len(layout_elements) + img_idx, - 'image_path': img_path, - 'type': 'image', - 'page': current_page, - 'bbox': bbox, - }) - - if layout_elements: - layout_data = { - 'elements': layout_elements, - 'total_elements': len(layout_elements), - 'reading_order': [e['reading_order'] for e in layout_elements], - 'layout_types': list(set(e.get('layout_type') for e in layout_elements)), - } - logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info") - return layout_data, images_metadata - else: - logger.warning("No layout elements found") - return None, [] - - except Exception as e: - logger.error(f"Advanced layout extraction failed: {e}") - import traceback - traceback.print_exc() - return None, [] - - def _create_element_from_parsing_res( - self, item: Dict, idx: int, current_page: int - ) -> Optional[Dict]: - """從 parsing_res_list 的一個 item 創建 element""" - # 提取 layout_bbox - layout_bbox = item.get('layout_bbox') - bbox = self._convert_bbox_to_4point(layout_bbox) - - # 提取版面類型 - layout_type = item.get('layout', 'single') - - # 創建基礎 element - element = { - 'element_id': idx, - 'page': current_page, - 'bbox': bbox, # ✅ 完整座標 - 'layout_type': layout_type, - 'reading_order': idx, - 'confidence': item.get('score', 0.0), - } - - # 根據內容類型填充 type 和 content - # 順序很重要!優先級: table > formula > image > title > text - - if 'table' in item and item['table']: - element['type'] = 'table' - element['content'] = item['table'] - # 提取表格純文字(用於翻譯) - element['extracted_text'] = self._extract_table_text(item['table']) - - elif 'formula' in item and item['formula']: - element['type'] = 'formula' - element['content'] = item['formula'] # LaTeX - - elif 'figure' in item or 'image' in item: - element['type'] = 'image' - element['content'] = item.get('figure') or item.get('image') - - elif 'title' in item and item['title']: - element['type'] = 'title' - element['content'] = item['title'] - - elif 'text' in item and item['text']: - element['type'] = 'text' - element['content'] = item['text'] - - else: - # 未知類型,嘗試提取任何非系統欄位 - for key, value in item.items(): - if key not in ['layout_bbox', 'layout', 'score'] and value: - element['type'] = key - element['content'] = value - break - else: - return None # 沒有內容,跳過 - - return element - - def _convert_bbox_to_4point(self, layout_bbox) -> List: - """轉換 layout_bbox 為 4-point 格式""" - if layout_bbox is None: - return [] - - # 處理 numpy array - if hasattr(layout_bbox, 'tolist'): - bbox = layout_bbox.tolist() - else: - bbox = list(layout_bbox) - - if len(bbox) == 4: # [x1, y1, x2, y2] - x1, y1, x2, y2 = bbox - return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] - - return [] - - def _extract_table_text(self, html_content: str) -> str: - """從 HTML 表格提取純文字(用於翻譯)""" - try: - from bs4 import BeautifulSoup - soup = BeautifulSoup(html_content, 'html.parser') - - # 提取所有 cell 的文字 - cells = [] - for cell in soup.find_all(['td', 'th']): - text = cell.get_text(strip=True) - if text: - cells.append(text) - - return ' | '.join(cells) - except Exception as e: - logger.warning(f"Failed to extract table text: {e}") - # Fallback: 簡單去除 HTML 標籤 - import re - text = re.sub(r'<[^>]+>', ' ', html_content) - text = re.sub(r'\s+', ' ', text) - return text.strip() -``` - -#### 3. PDFGeneratorService (重構) -**目的**: 支援雙模式 PDF 生成 - -```python -class PDFGeneratorService: - """ - PDF 生成服務 (重構版) - 支援兩種模式: - - coordinate: 座標定位模式 (精確還原版面) - - flow: 流式排版模式 (零資訊損失, 支援翻譯) - """ - - def generate_pdf( - self, - json_path: Path, - output_path: Path, - mode: str = 'coordinate', # 'coordinate' 或 'flow' - source_file_path: Optional[Path] = None - ) -> bool: - """ - 生成 PDF - - Args: - json_path: OCR JSON 檔案路徑 - output_path: 輸出 PDF 路徑 - mode: 生成模式 ('coordinate' 或 'flow') - source_file_path: 原始檔案路徑(用於獲取尺寸) - - Returns: - 成功返回 True - """ - try: - # 載入 OCR 數據 - ocr_data = self.load_ocr_json(json_path) - if not ocr_data: - return False - - # 根據模式選擇生成策略 - if mode == 'flow': - return self._generate_flow_pdf(ocr_data, output_path) - else: - return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path) - - except Exception as e: - logger.error(f"PDF generation failed: {e}") - import traceback - traceback.print_exc() - return False - - def _generate_coordinate_pdf( - self, - ocr_data: Dict, - output_path: Path, - source_file_path: Optional[Path] - ) -> bool: - """ - 模式 A: 座標定位模式 - - 使用 layout_bbox 精確定位每個元素 - - 保留原始文件的視覺外觀 - - 適用於需要精確還原版面的場景 - """ - logger.info("Generating PDF in COORDINATE mode (layout-preserving)") - - # 提取數據 - layout_data = ocr_data.get('layout_data', {}) - elements = layout_data.get('elements', []) - - if not elements: - logger.warning("No layout elements found") - return False - - # 按 reading_order 和 page 排序 - sorted_elements = sorted(elements, key=lambda x: ( - x.get('page', 0), - x.get('reading_order', 0) - )) - - # 計算頁面尺寸 - ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path) - target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height) - - scale_w = target_width / ocr_width - scale_h = target_height / ocr_height - - # 創建 PDF canvas - pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height)) - - # 按頁碼分組元素 - pages = {} - for elem in sorted_elements: - page = elem.get('page', 0) - if page not in pages: - pages[page] = [] - pages[page].append(elem) - - # 渲染每一頁 - for page_num, page_elements in sorted(pages.items()): - if page_num > 0: - pdf_canvas.showPage() - - logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements") - - # 按 reading_order 渲染每個元素 - for elem in page_elements: - bbox = elem.get('bbox', []) - elem_type = elem.get('type') - content = elem.get('content', '') - - if not bbox: - logger.warning(f"Element {elem['element_id']} has no bbox, skipping") - continue - - # 根據類型渲染 - try: - if elem_type == 'table': - self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h) - elif elem_type == 'text': - self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h) - elif elem_type == 'title': - self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h) - elif elem_type == 'image': - img_path = json_path.parent / content - if img_path.exists(): - self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h) - elif elem_type == 'formula': - self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h) - # ... 其他類型 - - except Exception as e: - logger.warning(f"Failed to draw {elem_type} element: {e}") - - pdf_canvas.save() - logger.info(f"✅ Coordinate PDF generated: {output_path}") - return True - - def _generate_flow_pdf( - self, - ocr_data: Dict, - output_path: Path - ) -> bool: - """ - 模式 B: 流式排版模式 - - 按 reading_order 流式排版 - - 零資訊損失(不過濾任何內容) - - 使用 ReportLab Platypus 高階 API - - 適用於需要翻譯或內容處理的場景 - """ - from reportlab.platypus import ( - SimpleDocTemplate, Paragraph, Spacer, - Table, TableStyle, Image as RLImage, PageBreak - ) - from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle - from reportlab.lib import colors - from reportlab.lib.enums import TA_LEFT, TA_CENTER - - logger.info("Generating PDF in FLOW mode (content-preserving)") - - # 提取數據 - layout_data = ocr_data.get('layout_data', {}) - elements = layout_data.get('elements', []) - - if not elements: - logger.warning("No layout elements found") - return False - - # 按 reading_order 排序 - sorted_elements = sorted(elements, key=lambda x: ( - x.get('page', 0), - x.get('reading_order', 0) - )) - - # 創建文檔 - doc = SimpleDocTemplate(str(output_path)) - story = [] - styles = getSampleStyleSheet() - - # 自定義樣式 - styles.add(ParagraphStyle( - name='CustomTitle', - parent=styles['Heading1'], - fontSize=18, - alignment=TA_CENTER, - spaceAfter=12 - )) - - current_page = -1 - - # 按順序添加元素 - for elem in sorted_elements: - elem_type = elem.get('type') - content = elem.get('content', '') - page = elem.get('page', 0) - - # 分頁 - if page != current_page and current_page != -1: - story.append(PageBreak()) - current_page = page - - try: - if elem_type == 'title': - story.append(Paragraph(content, styles['CustomTitle'])) - story.append(Spacer(1, 12)) - - elif elem_type == 'text': - story.append(Paragraph(content, styles['Normal'])) - story.append(Spacer(1, 8)) - - elif elem_type == 'table': - # 解析 HTML 表格為 ReportLab Table - table_obj = self._html_to_reportlab_table(content) - if table_obj: - story.append(table_obj) - story.append(Spacer(1, 12)) - - elif elem_type == 'image': - # 嵌入圖片 - img_path = output_path.parent.parent / content - if img_path.exists(): - img = RLImage(str(img_path), width=400, height=300, kind='proportional') - story.append(img) - story.append(Spacer(1, 12)) - - elif elem_type == 'formula': - # 公式顯示為等寬字體 - story.append(Paragraph(f"{content}", styles['Code'])) - story.append(Spacer(1, 8)) - - except Exception as e: - logger.warning(f"Failed to add {elem_type} element to flow: {e}") - - # 生成 PDF - doc.build(story) - logger.info(f"✅ Flow PDF generated: {output_path}") - return True -``` - ---- - -## 🔧 實作步驟 - -### 階段 1: 引擎層重構 (2-3 小時) - -1. **創建 PPStructureV3Engine 單例類** - - 檔案: `backend/app/engines/ppstructure_engine.py` (新增) - - 統一管理 PP-StructureV3 引擎 - - RTX 4060 8GB 最佳化配置 - -2. **創建 AdvancedLayoutExtractor 類** - - 檔案: `backend/app/services/advanced_layout_extractor.py` (新增) - - 實作 `extract_complete_layout()` - - 完整提取 parsing_res_list, layout_bbox, layout_det_res - -3. **更新 OCRService** - - 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor` - - 保持向後相容(回退到舊邏輯) - -### 階段 2: PDF 生成器重構 (3-4 小時) - -1. **重構 PDFGeneratorService** - - 添加 `mode` 參數 - - 實作 `_generate_coordinate_pdf()` - - 實作 `_generate_flow_pdf()` - -2. **添加輔助方法** - - `_draw_table_at_bbox()`: 在指定座標繪製表格 - - `_draw_text_at_bbox()`: 在指定座標繪製文字 - - `_draw_title_at_bbox()`: 在指定座標繪製標題 - - `_draw_formula_at_bbox()`: 在指定座標繪製公式 - - `_html_to_reportlab_table()`: HTML 轉 ReportLab Table - -3. **更新 API 端點** - - `/tasks/{id}/download/pdf?mode=coordinate` (預設) - - `/tasks/{id}/download/pdf?mode=flow` - -### 階段 3: 測試與優化 (2-3 小時) - -1. **單元測試** - - 測試 AdvancedLayoutExtractor - - 測試兩種 PDF 模式 - - 測試向後相容性 - -2. **效能測試** - - GPU 記憶體使用監控 - - 處理速度測試 - - 並發請求測試 - -3. **品質驗證** - - 座標準確度 - - 閱讀順序正確性 - - 表格識別準確度 - ---- - -## 📈 預期效果 - -### 功能改善 - -| 指標 | 目前 | 重構後 | 提升 | -|------|-----|--------|------| -| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ | -| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x | -| 閱讀順序 | 無 | 完整保留 | ✅ 100% | -| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% | -| PDF 模式 | 1 種 | 2 種 | ✅ 2x | -| 翻譯支援 | 困難 | 完美 | ✅ 100% | - -### GPU 使用優化 - -```python -# RTX 4060 8GB 配置效果 -配置項目 | 目前 | 重構後 -----------------|--------|-------- -GPU 利用率 | ~30% | ~70% -處理速度 | 0.5頁/秒 | 1.2頁/秒 -前處理功能 | 關閉 | 全開 -識別準確度 | ~85% | ~95% -``` - ---- - -## 🎯 遷移策略 - -### 向後相容性保證 - -1. **API 層面** - - 保留現有所有 API 端點 - - 添加可選的 `mode` 參數 - - 預設行為不變 - -2. **數據層面** - - 舊 JSON 檔案仍可使用 - - 新增欄位不影響舊邏輯 - - 漸進式更新 - -3. **部署策略** - - 先部署新引擎和服務 - - 逐步啟用新功能 - - 監控效能和錯誤率 - ---- - -## 📝 配置檔案 - -### requirements.txt 更新 - -```txt -# 現有依賴 -paddlepaddle-gpu>=3.0.0 -paddleocr>=3.0.0 - -# 新增依賴 -python-docx>=0.8.11 # Word 文檔生成 (可選) -PyMuPDF>=1.23.0 # PDF 處理增強 -beautifulsoup4>=4.12.0 # HTML 解析 -lxml>=4.9.0 # XML/HTML 解析加速 -``` - -### 環境變數配置 - -```bash -# .env.local 新增 -PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統 -PADDLE_USE_SERVER_MODEL=true -PADDLE_ENABLE_ALL_FEATURES=true - -# PDF 生成預設模式 -PDF_DEFAULT_MODE=coordinate # 或 flow -``` - ---- - -## 🚀 實作優先級 - -### P0 (立即實作) -1. ✅ PPStructureV3Engine 統一引擎 -2. ✅ AdvancedLayoutExtractor 完整提取 -3. ✅ 座標定位模式 PDF - -### P1 (第二階段) -4. ⭐ 流式排版模式 PDF -5. ⭐ API 端點更新 (mode 參數) - -### P2 (優化階段) -6. 效能監控和優化 -7. 批次處理支援 -8. 品質檢查工具 - ---- - -## ⚠️ 風險與緩解 - -### 風險 1: GPU 記憶體不足 -**緩解**: -- 合理設定 `gpu_mem=6144` (保留 2GB) -- 添加記憶體監控 -- 大文檔分批處理 - -### 風險 2: 處理速度下降 -**緩解**: -- Server 模型在 GPU 上比 Mobile 更快 -- 並行處理多頁 -- 結果快取 - -### 風險 3: 向後相容問題 -**緩解**: -- 保留舊邏輯作為回退 -- 逐步遷移 -- 完整測試覆蓋 - ---- - -**預計總開發時間**: 7-10 小時 -**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援 - -您希望我開始實作哪個階段? diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/PP-STRUCTURE-ENHANCEMENT-PLAN.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/PP-STRUCTURE-ENHANCEMENT-PLAN.md deleted file mode 100644 index 7ad0137..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/PP-STRUCTURE-ENHANCEMENT-PLAN.md +++ /dev/null @@ -1,691 +0,0 @@ -# PP-StructureV3 完整版面資訊利用計劃 - -## 📋 執行摘要 - -### 問題診斷 -目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`。 - -### 核心發現 -1. **PP-StructureV3 提供完整的版面解析資訊**,包括: - - `parsing_res_list`: 按閱讀順序排列的版面元素列表 - - `layout_bbox`: 每個元素的精確座標 - - `layout_det_res`: 版面檢測結果(區域類型、置信度) - - `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox) - - `layout`: 版面類型(單欄/雙欄/多欄) - -2. **目前實作的缺陷**: - ```python - # ❌ 目前做法 (ocr_service.py:615-646) - markdown_dict = page_result.markdown # 只獲取 markdown 和圖片 - markdown_texts = markdown_dict.get('markdown_texts', '') - # bbox 被設為空列表 - 'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format - ``` - -3. **應該這樣做**: - ```python - # ✅ 正確做法 - json_data = page_result.json # 獲取完整的結構化資訊 - parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox - layout_det = json_data.get('layout_det_res', {}) # 版面檢測 - overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標 - ``` - ---- - -## 🎯 規劃目標 - -### 階段 1: 提取完整版面資訊(高優先級) -**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力 - -**預期效果**: -- ✅ 每個版面元素都有精確的 `layout_bbox` -- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序) -- ✅ 獲取版面類型資訊(單欄/雙欄) -- ✅ 提取區域分類(text/table/figure/title/formula) -- ✅ 零資訊損失(不需要過濾重疊文字) - -### 階段 2: 實作雙模式 PDF 生成(中優先級) -**目標**: 提供兩種 PDF 生成模式 - -**模式 A: 精確座標定位模式** -- 使用 `layout_bbox` 精確定位每個元素 -- 保留原始文件的視覺外觀 -- 適用於需要精確還原版面的場景 - -**模式 B: 流式排版模式** -- 按 `parsing_res_list` 順序流式排版 -- 使用 ReportLab Platypus 高階 API -- 零資訊損失,所有內容都可搜尋 -- 適用於需要翻譯或內容處理的場景 - -### 階段 3: 多欄版面處理(低優先級) -**目標**: 利用 PP-StructureV3 的多欄識別能力 - ---- - -## 📊 PP-StructureV3 完整資料結構 - -### 1. `page_result.json` 完整結構 - -```python -{ - # 基本資訊 - "input_path": str, # 源文件路徑 - "page_index": int, # 頁碼(PDF 專用) - - # 版面檢測結果 - "layout_det_res": { - "boxes": [ - { - "cls_id": int, # 類別 ID - "label": str, # 區域類型: text/table/figure/title/formula/seal - "score": float, # 置信度 0-1 - "coordinate": [x1, y1, x2, y2] # 矩形座標 - }, - ... - ] - }, - - # 完整 OCR 結果 - "overall_ocr_res": { - "dt_polys": np.ndarray, # 文字檢測多邊形 - "rec_polys": np.ndarray, # 文字識別多邊形 - "rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16 - "rec_texts": List[str], # 識別的文字 - "rec_scores": np.ndarray # 識別置信度 - }, - - # **核心版面解析結果(按閱讀順序)** - "parsing_res_list": [ - { - "layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2] - "layout": str, # 版面類型: single/double/multi-column - "text": str, # 文字內容(如果是文字區域) - "table": str, # 表格 HTML(如果是表格區域) - "image": str, # 圖片路徑(如果是圖片區域) - "formula": str, # 公式 LaTeX(如果是公式區域) - # ... 其他區域類型 - }, - ... # 順序 = 閱讀順序 - ], - - # 文字段落 OCR(按閱讀順序) - "text_paragraphs_ocr_res": { - "rec_polys": np.ndarray, - "rec_texts": List[str], - "rec_scores": np.ndarray - }, - - # 可選模組結果 - "formula_res_region1": {...}, # 公式識別結果 - "table_cell_img": {...}, # 表格儲存格圖片 - "seal_res_region1": {...} # 印章識別結果 -} -``` - -### 2. 關鍵欄位說明 - -| 欄位 | 用途 | 資料格式 | 重要性 | -|------|------|---------|--------| -| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ | -| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ | -| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ | -| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ | -| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ | -| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ | - ---- - -## 🔧 實作計劃 - -### 任務 1: 重構 `analyze_layout()` 函數 - -**檔案**: `/backend/app/services/ocr_service.py` - -**修改範圍**: Lines 590-710 - -**核心改動**: - -```python -def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]: - """ - Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊) - """ - try: - structure_engine = self.get_structure_engine() - results = structure_engine.predict(str(image_path)) - - layout_elements = [] - images_metadata = [] - - for page_idx, page_result in enumerate(results): - # ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown - json_data = page_result.json - - # ✅ 修改 2: 提取版面檢測結果 - layout_det_res = json_data.get('layout_det_res', {}) - layout_boxes = layout_det_res.get('boxes', []) - - # ✅ 修改 3: 提取核心的 parsing_res_list(包含閱讀順序 + bbox) - parsing_res_list = json_data.get('parsing_res_list', []) - - if parsing_res_list: - # *** 核心邏輯:使用 parsing_res_list *** - for idx, item in enumerate(parsing_res_list): - # 提取 bbox(不再是空列表!) - layout_bbox = item.get('layout_bbox') - if layout_bbox is not None: - # 轉換 numpy array 為標準格式 - if hasattr(layout_bbox, 'tolist'): - bbox = layout_bbox.tolist() - else: - bbox = list(layout_bbox) - - # 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]] - if len(bbox) == 4: # [x1, y1, x2, y2] - x1, y1, x2, y2 = bbox - bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] - else: - bbox = [] - - # 提取版面類型 - layout_type = item.get('layout', 'single') - - # 創建元素(包含所有資訊) - element = { - 'element_id': idx, - 'page': current_page, - 'bbox': bbox, # ✅ 不再是空列表! - 'layout_type': layout_type, # ✅ 新增版面類型 - 'reading_order': idx, # ✅ 新增閱讀順序 - } - - # 根據內容類型提取資料 - if 'table' in item: - element['type'] = 'table' - element['content'] = item['table'] - # 提取表格純文字(用於翻譯) - element['extracted_text'] = self._extract_table_text(item['table']) - - elif 'text' in item: - element['type'] = 'text' - element['content'] = item['text'] - - elif 'figure' in item or 'image' in item: - element['type'] = 'image' - element['content'] = item.get('figure') or item.get('image') - - elif 'formula' in item: - element['type'] = 'formula' - element['content'] = item['formula'] - - elif 'title' in item: - element['type'] = 'title' - element['content'] = item['title'] - - else: - # 未知類型,記錄所有非系統欄位 - for key, value in item.items(): - if key not in ['layout_bbox', 'layout']: - element['type'] = key - element['content'] = value - break - - layout_elements.append(element) - - else: - # 回退到 markdown 方式(向後相容) - logger.warning("No parsing_res_list found, falling back to markdown parsing") - markdown_dict = page_result.markdown - # ... 原有的 markdown 解析邏輯 ... - - # ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟) - markdown_dict = page_result.markdown - markdown_images = markdown_dict.get('markdown_images', {}) - - for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()): - # 保存圖片到磁碟 - try: - base_dir = output_dir if output_dir else image_path.parent - full_img_path = base_dir / img_path - full_img_path.parent.mkdir(parents=True, exist_ok=True) - - if hasattr(img_obj, 'save'): - img_obj.save(str(full_img_path)) - logger.info(f"Saved extracted image to {full_img_path}") - except Exception as e: - logger.warning(f"Failed to save image {img_path}: {e}") - - # 提取 bbox(從檔名或從 parsing_res_list 匹配) - bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes) - - images_metadata.append({ - 'element_id': len(layout_elements) + img_idx, - 'image_path': img_path, - 'type': 'image', - 'page': current_page, - 'bbox': bbox, - }) - - if layout_elements: - layout_data = { - 'elements': layout_elements, - 'total_elements': len(layout_elements), - 'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序 - 'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計 - } - logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)") - return layout_data, images_metadata - else: - logger.warning("No layout elements detected") - return None, [] - - except Exception as e: - import traceback - logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}") - return None, [] - - -def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List: - """ - 從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox - """ - # 方法 1: 從檔名提取(現有方法) - import re - match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path) - if match: - x1, y1, x2, y2 = map(int, match.groups()) - return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] - - # 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊) - for item in parsing_res_list: - if 'image' in item or 'figure' in item: - content = item.get('image') or item.get('figure') - if img_path in str(content): - bbox = item.get('layout_bbox') - if bbox is not None: - if hasattr(bbox, 'tolist'): - bbox_list = bbox.tolist() - else: - bbox_list = list(bbox) - if len(bbox_list) == 4: - x1, y1, x2, y2 = bbox_list - return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] - - # 方法 3: 從 layout_det_res 匹配(根據類型) - for box in layout_boxes: - if box.get('label') in ['figure', 'image']: - coord = box.get('coordinate', []) - if len(coord) == 4: - x1, y1, x2, y2 = coord - return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]] - - logger.warning(f"Could not find bbox for image {img_path}") - return [] -``` - ---- - -### 任務 2: 更新 PDF 生成器使用新資訊 - -**檔案**: `/backend/app/services/pdf_generator_service.py` - -**核心改動**: - -1. **移除文字過濾邏輯**(不再需要!) - - 因為 `parsing_res_list` 已經按閱讀順序排列 - - 表格/圖片有自己的區域,文字有自己的區域 - - 不會有重疊問題 - -2. **按 `reading_order` 渲染元素** - ```python - def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool: - """ - mode: 'coordinate' 或 'flow' - """ - # 載入資料 - ocr_data = self.load_ocr_json(json_path) - layout_data = ocr_data.get('layout_data', {}) - elements = layout_data.get('elements', []) - - if mode == 'coordinate': - # 模式 A: 座標定位模式 - return self._generate_coordinate_pdf(elements, output_path, ocr_data) - else: - # 模式 B: 流式排版模式 - return self._generate_flow_pdf(elements, output_path, ocr_data) - - def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool: - """座標定位模式 - 精確還原版面""" - # 按 reading_order 排序元素 - sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0)) - - # 按頁碼分組 - pages = {} - for elem in sorted_elements: - page = elem.get('page', 0) - if page not in pages: - pages[page] = [] - pages[page].append(elem) - - # 渲染每頁 - for page_num, page_elements in sorted(pages.items()): - for elem in page_elements: - bbox = elem.get('bbox', []) - elem_type = elem.get('type') - content = elem.get('content', '') - - if not bbox: - logger.warning(f"Element {elem['element_id']} has no bbox, skipping") - continue - - # 使用精確座標渲染 - if elem_type == 'table': - self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h) - elif elem_type == 'text': - self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h) - elif elem_type == 'image': - self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h) - # ... 其他類型 - - def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool: - """流式排版模式 - 零資訊損失""" - from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer - from reportlab.lib.styles import getSampleStyleSheet - - # 按 reading_order 排序元素 - sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0)) - - # 創建 Story(流式內容) - story = [] - styles = getSampleStyleSheet() - - for elem in sorted_elements: - elem_type = elem.get('type') - content = elem.get('content', '') - - if elem_type == 'title': - story.append(Paragraph(content, styles['Title'])) - elif elem_type == 'text': - story.append(Paragraph(content, styles['Normal'])) - elif elem_type == 'table': - # 解析 HTML 表格為 ReportLab Table - table_obj = self._html_to_reportlab_table(content) - story.append(table_obj) - elif elem_type == 'image': - # 嵌入圖片 - img_path = json_path.parent / content - if img_path.exists(): - story.append(Image(str(img_path), width=400, height=300)) - - story.append(Spacer(1, 12)) # 間距 - - # 生成 PDF - doc = SimpleDocTemplate(str(output_path)) - doc.build(story) - return True - ``` - ---- - -## 📈 預期效果對比 - -### 目前實作 vs 新實作 - -| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 | -|------|-----------|----------|------| -| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% | -| **閱讀順序** | 無(混合 HTML) | `reading_order` 欄位 | ✅ 100% | -| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% | -| **元素分類** | 簡單判斷 `= 3.7) -pip install pdf2docx # PDF 轉 DOCX - -# PDF 生成所需 -pip install reportlab # 自定義 PDF 生成 -pip install markdown # Markdown 處理 -``` - -### 3.3 可選依賴 - -```bash -# 選擇性功能 -pip install opencv-python # 影像處理 -pip install Pillow # 圖片處理 -pip install lxml # HTML/XML 解析 -pip install beautifulsoup4 # HTML 美化 -``` - ---- - -## 四、核心實作方案 - -### 4.1 使用 PP-StructureV3 (推薦) - -```python -from paddleocr import PPStructureV3 -from pathlib import Path -import json - -class DocumentLayoutExtractor: - """文檔版面提取與還原""" - - def __init__(self, use_gpu=True): - """初始化 PP-StructureV3""" - self.engine = PPStructureV3( - # 文檔前處理 - use_doc_orientation_classify=True, # 文檔方向分類 - use_doc_unwarping=True, # 文檔影像矯正 - use_textline_orientation=True, # 文字行方向 - - # 功能模組開關 - use_seal_recognition=True, # 印章識別 - use_table_recognition=True, # 表格識別 - use_formula_recognition=True, # 公式識別 - use_chart_recognition=True, # 圖表識別 - - # OCR 模型配置 - text_recognition_model_name="ch_PP-OCRv4_server_rec", # 中文識別 - # text_recognition_model_name="en_PP-OCRv4_mobile_rec", # 英文識別 - - # 版面檢測參數調整 - layout_threshold=0.5, # 版面檢測閾值 - layout_nms=0.5, # NMS 閾值 - layout_unclip_ratio=1.5, # 邊界框擴展比例 - - show_log=True - ) - - def extract_layout(self, input_path, output_dir="output"): - """ - 提取完整版面資訊 - - Args: - input_path: PDF或圖片路徑 - output_dir: 輸出目錄 - - Returns: - list: 每頁的結構化結果 - """ - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - # 執行文檔解析 - results = self.engine.predict(input_path) - - all_pages_data = [] - - for page_idx, result in enumerate(results): - page_data = { - "page": page_idx + 1, - "regions": [] - } - - # 遍歷該頁的所有區域 - for region in result: - region_info = { - "type": region.get("type"), # 區域類別 - "bbox": region.get("bbox"), # 邊界框 [x1,y1,x2,y2] - "score": region.get("score", 0), # 置信度 - "content": {}, - "reading_order": region.get("layout_bbox_idx", 0) # 閱讀順序 - } - - # 根據類別提取不同內容 - region_type = region.get("type") - - if region_type in ["text", "title", "header", "footer"]: - # OCR 文字區域 - region_info["content"] = { - "text": region.get("res", []), - "ocr_boxes": region.get("text_region", []) - } - - elif region_type == "table": - # 表格區域 - region_info["content"] = { - "html": region.get("res", {}).get("html", ""), - "text": region.get("res", {}).get("text", ""), - "structure": region.get("res", {}).get("cell_bbox", []) - } - - elif region_type == "formula": - # 公式區域 - region_info["content"] = { - "latex": region.get("res", "") - } - - elif region_type == "figure": - # 圖片區域 - region_info["content"] = { - "image_path": f"page_{page_idx+1}_figure_{len(page_data['regions'])}.png" - } - # 儲存圖片 - if "img" in region: - img_path = output_path / region_info["content"]["image_path"] - region["img"].save(img_path) - - elif region_type == "seal": - # 印章區域 - region_info["content"] = { - "text": region.get("res", ""), - "seal_bbox": region.get("seal_bbox", []) - } - - page_data["regions"].append(region_info) - - # 按閱讀順序排序 - page_data["regions"].sort(key=lambda x: x["reading_order"]) - - all_pages_data.append(page_data) - - # 儲存該頁的 JSON - json_path = output_path / f"page_{page_idx+1}.json" - with open(json_path, "w", encoding="utf-8") as f: - json.dump(page_data, f, ensure_ascii=False, indent=2) - - print(f"✓ 已處理第 {page_idx+1} 頁") - - # 儲存完整文檔結構 - full_json = output_path / "document_structure.json" - with open(full_json, "w", encoding="utf-8") as f: - json.dump(all_pages_data, f, ensure_ascii=False, indent=2) - - return all_pages_data - - def export_to_markdown(self, input_path, output_dir="output"): - """ - 直接導出為 Markdown (PP-StructureV3 內建) - """ - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - results = self.engine.predict(input_path) - - for page_idx, result in enumerate(results): - # PP-StructureV3 自動生成 Markdown - md_content = result.get("markdown", "") - - if md_content: - md_path = output_path / f"page_{page_idx+1}.md" - with open(md_path, "w", encoding="utf-8") as f: - f.write(md_content) - print(f"✓ 已生成 Markdown: {md_path}") - - -# 使用範例 -if __name__ == "__main__": - extractor = DocumentLayoutExtractor(use_gpu=True) - - # 提取版面資訊 - layout_data = extractor.extract_layout( - input_path="document.pdf", - output_dir="output/layout" - ) - - # 導出 Markdown - extractor.export_to_markdown( - input_path="document.pdf", - output_dir="output/markdown" - ) -``` - -### 4.2 使用 PaddleOCR-VL (更簡潔) - -```python -from paddleocr import PaddleOCRVL -from pathlib import Path - -class SimpleDocumentParser: - """使用 PaddleOCR-VL 的簡化方案""" - - def __init__(self): - self.pipeline = PaddleOCRVL( - use_doc_orientation_classify=True, - use_doc_unwarping=True, - use_layout_detection=True, - use_chart_recognition=True - ) - - def parse_document(self, input_path, output_dir="output"): - """一鍵解析文檔""" - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - # 執行解析 - results = self.pipeline.predict(input_path) - - for res in results: - # 列印結構化輸出 - res.print() - - # 儲存 JSON - res.save_to_json(save_path=str(output_path)) - - # 儲存 Markdown - res.save_to_markdown(save_path=str(output_path)) - - print(f"✓ 解析完成,結果已儲存至: {output_path}") - - def parse_pdf_to_single_markdown(self, pdf_path, output_path="output"): - """將整個 PDF 轉為單一 Markdown 檔案""" - output_dir = Path(output_path) - output_dir.mkdir(parents=True, exist_ok=True) - - # 解析 PDF - results = self.pipeline.predict(pdf_path) - - markdown_list = [] - markdown_images = [] - - # 收集所有頁面的 Markdown - for res in results: - md_info = res.markdown - markdown_list.append(md_info) - markdown_images.append(md_info.get("markdown_images", {})) - - # 合併所有頁面 - markdown_text = self.pipeline.concatenate_markdown_pages(markdown_list) - - # 儲存 Markdown 檔案 - md_file = output_dir / f"{Path(pdf_path).stem}.md" - with open(md_file, "w", encoding="utf-8") as f: - f.write(markdown_text) - - # 儲存相關圖片 - for item in markdown_images: - if item: - for path, image in item.items(): - img_path = output_dir / path - img_path.parent.mkdir(parents=True, exist_ok=True) - image.save(img_path) - - print(f"✓ 已生成單一 Markdown: {md_file}") - return md_file - - -# 使用範例 -if __name__ == "__main__": - parser = SimpleDocumentParser() - - # 簡單解析 - parser.parse_document("document.pdf", "output") - - # 生成單一 Markdown - parser.parse_pdf_to_single_markdown("document.pdf", "output") -``` - ---- - -## 五、PDF 還原與生成 - -### 5.1 版面恢復策略 - -PaddleOCR 提供兩種版面恢復方法: - -#### 方法 1: 標準 PDF 解析 (適用於可複製文字的 PDF) -```bash -# 使用 pdf2docx 直接轉換 -paddleocr --image_dir=document.pdf \ - --type=structure \ - --recovery=true \ - --use_pdf2docx_api=true -``` - -**優點**: 快速、保留原始格式 -**缺點**: 僅適用於標準 PDF,掃描文檔無效 - -#### 方法 2: 影像格式 PDF 解析 (通用方案) -```bash -# 使用完整 OCR Pipeline -paddleocr --image_dir=document.pdf \ - --type=structure \ - --recovery=true \ - --use_pdf2docx_api=false -``` - -**優點**: 適用於掃描文檔、複雜版面 -**缺點**: 速度較慢、需要更多計算資源 - -### 5.2 自定義 PDF 生成方案 - -```python -from reportlab.lib.pagesizes import A4 -from reportlab.lib.units import mm -from reportlab.pdfgen import canvas -from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle -from reportlab.platypus import ( - SimpleDocTemplate, Paragraph, Spacer, - Table, TableStyle, Image, PageBreak -) -from reportlab.lib import colors -from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT -import json - -class PDFLayoutRecovery: - """基於版面資訊的 PDF 還原""" - - def __init__(self, layout_json_path): - """ - Args: - layout_json_path: 版面 JSON 檔案路徑 - """ - with open(layout_json_path, 'r', encoding='utf-8') as f: - self.layout_data = json.load(f) - - self.styles = getSampleStyleSheet() - self._create_custom_styles() - - def _create_custom_styles(self): - """建立自定義樣式""" - # 標題樣式 - self.styles.add(ParagraphStyle( - name='CustomTitle', - parent=self.styles['Heading1'], - fontSize=18, - textColor=colors.HexColor('#1a1a1a'), - spaceAfter=12, - alignment=TA_CENTER - )) - - # 段落標題樣式 - self.styles.add(ParagraphStyle( - name='CustomHeading', - parent=self.styles['Heading2'], - fontSize=14, - textColor=colors.HexColor('#333333'), - spaceAfter=8, - spaceBefore=8 - )) - - # 正文樣式 - self.styles.add(ParagraphStyle( - name='CustomBody', - parent=self.styles['Normal'], - fontSize=11, - leading=16, - textColor=colors.HexColor('#000000'), - alignment=TA_LEFT - )) - - def generate_pdf(self, output_path="output.pdf"): - """生成 PDF""" - doc = SimpleDocTemplate( - output_path, - pagesize=A4, - rightMargin=20*mm, - leftMargin=20*mm, - topMargin=20*mm, - bottomMargin=20*mm - ) - - story = [] - - # 遍歷所有頁面 - for page_data in self.layout_data: - page_num = page_data.get("page", 1) - - # 頁面標記 (可選) - # story.append(Paragraph(f"--- 第 {page_num} 頁 ---", self.styles['CustomHeading'])) - # story.append(Spacer(1, 5*mm)) - - # 遍歷該頁的所有區域 - for region in page_data.get("regions", []): - region_type = region.get("type") - content = region.get("content", {}) - - if region_type in ["title", "document_title"]: - # 標題 - text = self._extract_text_from_ocr(content) - if text: - story.append(Paragraph(text, self.styles['CustomTitle'])) - story.append(Spacer(1, 3*mm)) - - elif region_type in ["text", "paragraph"]: - # 正文 - text = self._extract_text_from_ocr(content) - if text: - story.append(Paragraph(text, self.styles['CustomBody'])) - story.append(Spacer(1, 2*mm)) - - elif region_type in ["paragraph_title", "heading"]: - # 段落標題 - text = self._extract_text_from_ocr(content) - if text: - story.append(Paragraph(text, self.styles['CustomHeading'])) - story.append(Spacer(1, 2*mm)) - - elif region_type == "table": - # 表格 - table_element = self._create_table_from_html(content) - if table_element: - story.append(table_element) - story.append(Spacer(1, 3*mm)) - - elif region_type == "figure": - # 圖片 - img_path = content.get("image_path") - if img_path and Path(img_path).exists(): - try: - img = Image(img_path, width=150*mm, height=100*mm, kind='proportional') - story.append(img) - story.append(Spacer(1, 3*mm)) - except: - pass - - elif region_type == "formula": - # 公式 (作為程式碼區塊顯示) - latex = content.get("latex", "") - if latex: - story.append(Paragraph(f"{latex}", - self.styles['Code'])) - story.append(Spacer(1, 2*mm)) - - # 分頁 (除了最後一頁) - if page_num < len(self.layout_data): - story.append(PageBreak()) - - # 生成 PDF - doc.build(story) - print(f"✓ PDF 已生成: {output_path}") - - def _extract_text_from_ocr(self, content): - """從 OCR 結果提取文字""" - if isinstance(content.get("text"), str): - return content["text"] - elif isinstance(content.get("text"), list): - # OCR 結果是列表形式 - texts = [] - for item in content["text"]: - if isinstance(item, dict) and "text" in item: - texts.append(item["text"]) - elif isinstance(item, (list, tuple)) and len(item) >= 2: - texts.append(item[1]) # (bbox, text, confidence) 格式 - return " ".join(texts) - return "" - - def _create_table_from_html(self, content): - """從 HTML 建立表格""" - # 簡化版:從 text 提取 - text = content.get("text", "") - if not text: - return None - - # 這裡可以解析 HTML 或直接使用文字 - # 為簡化起見,這裡僅展示基本結構 - try: - # 假設文字格式為行分隔 - rows = [row.split("\t") for row in text.split("\n") if row.strip()] - - if not rows: - return None - - table = Table(rows) - table.setStyle(TableStyle([ - ('BACKGROUND', (0, 0), (-1, 0), colors.grey), - ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), - ('ALIGN', (0, 0), (-1, -1), 'CENTER'), - ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'), - ('FONTSIZE', (0, 0), (-1, 0), 10), - ('BOTTOMPADDING', (0, 0), (-1, 0), 8), - ('BACKGROUND', (0, 1), (-1, -1), colors.beige), - ('GRID', (0, 0), (-1, -1), 0.5, colors.black) - ])) - - return table - except: - return None - - -# 使用範例 -if __name__ == "__main__": - # 先用 PP-StructureV3 提取版面 - extractor = DocumentLayoutExtractor() - layout_data = extractor.extract_layout("document.pdf", "output/layout") - - # 生成 PDF - pdf_recovery = PDFLayoutRecovery("output/layout/document_structure.json") - pdf_recovery.generate_pdf("output/recovered_document.pdf") -``` - -### 5.3 使用 python-docx 生成 Word 文檔 - -```python -from docx import Document -from docx.shared import Inches, Pt, RGBColor -from docx.enum.text import WD_ALIGN_PARAGRAPH -import json - -class DOCXLayoutRecovery: - """基於版面資訊的 DOCX 還原""" - - def __init__(self, layout_json_path): - with open(layout_json_path, 'r', encoding='utf-8') as f: - self.layout_data = json.load(f) - - self.doc = Document() - - def generate_docx(self, output_path="output.docx"): - """生成 DOCX""" - for page_data in self.layout_data: - for region in page_data.get("regions", []): - region_type = region.get("type") - content = region.get("content", {}) - - if region_type in ["title", "document_title"]: - # 標題 - text = self._extract_text(content) - if text: - heading = self.doc.add_heading(text, level=1) - heading.alignment = WD_ALIGN_PARAGRAPH.CENTER - - elif region_type == "text": - # 正文 - text = self._extract_text(content) - if text: - para = self.doc.add_paragraph(text) - para.paragraph_format.first_line_indent = Inches(0.5) - - elif region_type == "paragraph_title": - # 小標題 - text = self._extract_text(content) - if text: - self.doc.add_heading(text, level=2) - - elif region_type == "table": - # 表格 (簡化版) - text = content.get("text", "") - if text: - rows = [row.split("\t") for row in text.split("\n")] - if rows: - table = self.doc.add_table(rows=len(rows), cols=len(rows[0])) - table.style = 'Light Grid Accent 1' - - for i, row_data in enumerate(rows): - for j, cell_data in enumerate(row_data): - table.rows[i].cells[j].text = cell_data - - elif region_type == "figure": - # 圖片 - img_path = content.get("image_path") - if img_path and Path(img_path).exists(): - try: - self.doc.add_picture(img_path, width=Inches(5)) - except: - pass - - # 分頁 - self.doc.add_page_break() - - self.doc.save(output_path) - print(f"✓ DOCX 已生成: {output_path}") - - def _extract_text(self, content): - """提取文字""" - if isinstance(content.get("text"), str): - return content["text"] - elif isinstance(content.get("text"), list): - texts = [] - for item in content["text"]: - if isinstance(item, dict): - texts.append(item.get("text", "")) - elif isinstance(item, (list, tuple)) and len(item) >= 2: - texts.append(item[1]) - return " ".join(texts) - return "" -``` - ---- - -## 六、進階配置與優化 - -### 6.1 性能優化 - -```python -# 輕量級配置 (CPU 環境) -extractor = PPStructureV3( - # 使用 mobile 模型 - text_detection_model_name="ch_PP-OCRv4_mobile_det", - text_recognition_model_name="ch_PP-OCRv4_mobile_rec", - - # 關閉部分功能 - use_chart_recognition=False, # 圖表識別較耗時 - use_formula_recognition=False, # 公式識別需要較大模型 - - # 降低精度要求 - layout_threshold=0.6, - text_det_limit_side_len=960, # 降低檢測尺寸 -) - -# 高精度配置 (GPU 環境) -extractor = PPStructureV3( - # 使用 server 模型 - text_detection_model_name="ch_PP-OCRv4_server_det", - text_recognition_model_name="ch_PP-OCRv4_server_rec", - - # 啟用所有功能 - use_chart_recognition=True, - use_formula_recognition=True, - use_seal_recognition=True, - - # 提高精度 - layout_threshold=0.3, - text_det_limit_side_len=1920, -) -``` - -### 6.2 批次處理 - -```python -import os -from pathlib import Path -from concurrent.futures import ThreadPoolExecutor, as_completed - -class BatchDocumentProcessor: - """批次文檔處理器""" - - def __init__(self, max_workers=4): - self.extractor = PPStructureV3() - self.max_workers = max_workers - - def process_directory(self, input_dir, output_dir="output", file_types=None): - """ - 批次處理目錄下的所有文檔 - - Args: - input_dir: 輸入目錄 - output_dir: 輸出目錄 - file_types: 支援的檔案類型,預設 ['.pdf', '.png', '.jpg'] - """ - if file_types is None: - file_types = ['.pdf', '.png', '.jpg', '.jpeg'] - - input_path = Path(input_dir) - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - # 收集所有待處理檔案 - files = [] - for file_type in file_types: - files.extend(input_path.glob(f"**/*{file_type}")) - - print(f"找到 {len(files)} 個檔案待處理") - - # 多執行緒處理 - with ThreadPoolExecutor(max_workers=self.max_workers) as executor: - futures = { - executor.submit(self._process_single_file, file, output_path): file - for file in files - } - - for future in as_completed(futures): - file = futures[future] - try: - result = future.result() - print(f"✓ 完成: {file.name}") - except Exception as e: - print(f"✗ 失敗: {file.name} - {e}") - - def _process_single_file(self, file_path, output_dir): - """處理單一檔案""" - file_stem = file_path.stem - file_output_dir = output_dir / file_stem - file_output_dir.mkdir(parents=True, exist_ok=True) - - # 提取版面 - results = self.extractor.predict(str(file_path)) - - # 儲存結果 - for idx, result in enumerate(results): - # 儲存 JSON - json_path = file_output_dir / f"page_{idx+1}.json" - with open(json_path, "w", encoding="utf-8") as f: - json.dump(result, f, ensure_ascii=False, indent=2) - - # 儲存 Markdown - if "markdown" in result: - md_path = file_output_dir / f"page_{idx+1}.md" - with open(md_path, "w", encoding="utf-8") as f: - f.write(result["markdown"]) - - return file_output_dir - - -# 使用範例 -if __name__ == "__main__": - processor = BatchDocumentProcessor(max_workers=4) - processor.process_directory( - input_dir="documents", - output_dir="output", - file_types=['.pdf'] - ) -``` - ---- - -## 七、實際應用場景 - -### 7.1 學術論文處理 - -```python -class AcademicPaperProcessor: - """學術論文專用處理器""" - - def __init__(self): - self.extractor = PPStructureV3( - use_formula_recognition=True, # 公式識別 - use_table_recognition=True, # 表格識別 - use_chart_recognition=True, # 圖表識別 - ) - - def process_paper(self, pdf_path, output_dir="output"): - """ - 處理學術論文 - - 提取標題、摘要、章節 - - 識別公式並轉為 LaTeX - - 提取表格和圖表 - """ - results = self.extractor.predict(pdf_path) - - paper_structure = { - "title": "", - "abstract": "", - "sections": [], - "formulas": [], - "tables": [], - "figures": [] - } - - for result in results: - for region in result: - region_type = region.get("type") - - if region_type == "document_title": - paper_structure["title"] = self._extract_text(region) - - elif region_type == "abstract": - paper_structure["abstract"] = self._extract_text(region) - - elif region_type == "formula": - latex = region.get("content", {}).get("latex", "") - if latex: - paper_structure["formulas"].append(latex) - - elif region_type == "table": - paper_structure["tables"].append(region.get("content")) - - elif region_type == "figure": - paper_structure["figures"].append(region.get("content")) - - # 儲存結構化結果 - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - with open(output_path / "paper_structure.json", "w", encoding="utf-8") as f: - json.dump(paper_structure, f, ensure_ascii=False, indent=2) - - return paper_structure -``` - -### 7.2 商業文檔處理 - -```python -class BusinessDocumentProcessor: - """商業文檔處理器""" - - def __init__(self): - self.extractor = PPStructureV3( - use_seal_recognition=True, # 印章識別 - use_table_recognition=True, # 表格識別 - ) - - def process_invoice(self, pdf_path): - """處理發票/合約等商業文檔""" - results = self.extractor.predict(pdf_path) - - doc_info = { - "text_content": [], - "tables": [], - "seals": [] - } - - for result in results: - for region in result: - region_type = region.get("type") - - if region_type == "seal": - doc_info["seals"].append(region.get("content")) - - elif region_type == "table": - doc_info["tables"].append(region.get("content")) - - elif region_type == "text": - doc_info["text_content"].append(self._extract_text(region)) - - return doc_info -``` - ---- - -## 八、常見問題與解決方案 - -### 8.1 記憶體不足 - -```python -# 方案1: 分批處理 PDF 頁面 -def process_large_pdf(pdf_path, batch_size=10): - import fitz # PyMuPDF - - doc = fitz.open(pdf_path) - total_pages = len(doc) - - for start_idx in range(0, total_pages, batch_size): - end_idx = min(start_idx + batch_size, total_pages) - - # 提取該批次頁面為臨時 PDF - temp_pdf = fitz.open() - temp_pdf.insert_pdf(doc, from_page=start_idx, to_page=end_idx-1) - temp_path = f"temp_batch_{start_idx}_{end_idx}.pdf" - temp_pdf.save(temp_path) - temp_pdf.close() - - # 處理臨時 PDF - extractor = PPStructureV3() - results = extractor.predict(temp_path) - - # 處理結果... - - # 清理 - os.remove(temp_path) - - doc.close() - -# 方案2: 使用輕量級模型 -extractor = PPStructureV3( - text_detection_model_name="ch_PP-OCRv4_mobile_det", - text_recognition_model_name="ch_PP-OCRv4_mobile_rec", -) -``` - -### 8.2 處理速度優化 - -```python -# 方案1: 僅處理必要內容 -extractor = PPStructureV3( - use_chart_recognition=False, # 關閉圖表識別 - use_formula_recognition=False, # 關閉公式識別 - use_seal_recognition=False, # 關閉印章識別 -) - -# 方案2: 降低影像解析度 -extractor = PPStructureV3( - text_det_limit_side_len=960, # 預設 1920 -) - -# 方案3: 啟用 MKL-DNN (CPU 加速) -extractor = PPStructureV3( - enable_mkldnn=True # CPU 環境下加速 -) -``` - -### 8.3 識別準確度問題 - -```python -# 方案1: 使用高精度模型 -extractor = PPStructureV3( - text_detection_model_name="ch_PP-OCRv4_server_det", - text_recognition_model_name="ch_PP-OCRv4_server_rec", -) - -# 方案2: 調整檢測參數 -extractor = PPStructureV3( - text_det_thresh=0.3, # 降低檢測閾值(預設0.5) - text_det_box_thresh=0.5, # 降低邊界框閾值 - text_det_unclip_ratio=1.8, # 增加邊界框擴展(預設1.5) -) - -# 方案3: 啟用前處理 -extractor = PPStructureV3( - use_doc_orientation_classify=True, # 文檔方向校正 - use_doc_unwarping=True, # 文檔影像矯正 - use_textline_orientation=True, # 文字行方向校正 -) -``` - ---- - -## 九、最佳實踐建議 - -### 9.1 選擇合適的方案 - -| 場景 | 推薦方案 | 理由 | -|------|---------|------| -| 標準 PDF (可複製文字) | pdf2docx | 最快速,格式保留好 | -| 掃描文檔 | PP-StructureV3 | 完整 OCR + 版面分析 | -| 複雜排版 | PaddleOCR-VL | 端到端,準確度高 | -| 學術論文 | PP-StructureV3+ 公式識別 | 支援 LaTeX 公式 | -| 商業合約 | PP-StructureV3 + 印章識別 | 需要印章檢測 | - -### 9.2 版面還原質量保證 - -```python -class QualityAssurance: - """版面還原質量檢查""" - - @staticmethod - def check_text_coverage(ocr_results, min_confidence=0.7): - """檢查 OCR 置信度""" - low_confidence_items = [] - - for item in ocr_results: - if isinstance(item, (list, tuple)) and len(item) >= 3: - _, text, confidence = item[0], item[1], item[2] - if confidence < min_confidence: - low_confidence_items.append({ - "text": text, - "confidence": confidence - }) - - if low_confidence_items: - print(f"⚠ 發現 {len(low_confidence_items)} 個低置信度識別") - return False - return True - - @staticmethod - def validate_layout_structure(layout_data): - """驗證版面結構完整性""" - issues = [] - - for page_idx, page in enumerate(layout_data): - regions = page.get("regions", []) - - # 檢查是否有標題 - if not any(r.get("type") in ["title", "document_title"] for r in regions): - issues.append(f"第 {page_idx+1} 頁缺少標題") - - # 檢查閱讀順序 - orders = [r.get("reading_order", 0) for r in regions] - if len(orders) != len(set(orders)): - issues.append(f"第 {page_idx+1} 頁閱讀順序有重複") - - if issues: - print("⚠ 版面結構問題:") - for issue in issues: - print(f" - {issue}") - return False - return True -``` - -### 9.3 錯誤處理 - -```python -class RobustDocumentProcessor: - """具備容錯機制的文檔處理器""" - - def __init__(self): - self.extractor = None - self._init_extractor() - - def _init_extractor(self, retry=3): - """初始化提取器,支援重試""" - for attempt in range(retry): - try: - self.extractor = PPStructureV3() - print("✓ 初始化成功") - break - except Exception as e: - print(f"✗ 初始化失敗 (嘗試 {attempt+1}/{retry}): {e}") - if attempt == retry - 1: - raise - - def safe_process(self, input_path, output_dir="output"): - """安全處理文檔""" - try: - # 檢查檔案存在 - if not Path(input_path).exists(): - raise FileNotFoundError(f"檔案不存在: {input_path}") - - # 執行處理 - results = self.extractor.predict(input_path) - - # 驗證結果 - if not results: - raise ValueError("處理結果為空") - - # 儲存結果 - output_path = Path(output_dir) - output_path.mkdir(parents=True, exist_ok=True) - - for idx, result in enumerate(results): - json_path = output_path / f"page_{idx+1}.json" - with open(json_path, "w", encoding="utf-8") as f: - json.dump(result, f, ensure_ascii=False, indent=2) - - print(f"✓ 處理成功: {len(results)} 頁") - return True - - except Exception as e: - print(f"✗ 處理失敗: {e}") - import traceback - traceback.print_exc() - return False -``` - ---- - -## 十、總結與建議 - -### 10.1 核心要點 - -1. **PP-StructureV3** 是最完整的解決方案,支援: - - 23 種版面元素類別 - - 表格/公式/圖表識別 - - 閱讀順序恢復 - - Markdown/JSON 輸出 - -2. **PaddleOCR-VL** 適合追求簡潔的場景: - - 端到端處理 - - 資源消耗較少 - - 109 種語言支援 - -3. **版面還原** 兩種路徑: - - 標準 PDF → pdf2docx (快速) - - 掃描文檔 → PP-StructureV3 + ReportLab/python-docx (完整) - -### 10.2 推薦工作流程 - -``` -1. 文檔預處理 - ├── 檢查 PDF 類型 (標準/掃描) - ├── 影像品質評估 - └── 頁面分割 - -2. 版面提取 - ├── PP-StructureV3.predict() - ├── 提取結構化資訊 (JSON) - └── 驗證完整性 - -3. 內容處理 - ├── OCR 文字校正 - ├── 表格結構化 - ├── 公式轉 LaTeX - └── 圖片提取 - -4. 版面還原 - ├── 解析 JSON 結構 - ├── 重建版面元素 - └── 生成目標格式 (PDF/DOCX/MD) - -5. 質量檢查 - ├── 文字覆蓋率 - ├── 版面完整性 - └── 格式一致性 -``` - -### 10.3 性能參考 - -| 配置 | 硬體 | 速度 (頁/秒) | 準確度 | -|------|------|-------------|--------| -| Mobile 模型 + CPU | Intel 8350C | ~0.27 | 85% | -| Server 模型 + V100 | V100 GPU | ~1.5 | 95% | -| Server 模型 + A100 | A100 GPU | ~3.0 | 95% | - -### 10.4 下一步建議 - -1. **建立測試集**: 準備不同類型的文檔樣本 -2. **參數調優**: 根據實際文檔調整檢測閾值 -3. **後處理優化**: 針對特定格式開發專用處理邏輯 -4. **整合 LLM**: 結合大語言模型進行智慧校正 -5. **建立監控**: 追蹤處理質量和性能指標 - ---- - -## 附錄 - -### A. 完整依賴清單 - -```txt -paddlepaddle-gpu>=3.0.0 -paddleocr>=3.0.0 -python-docx>=0.8.11 -PyMuPDF>=1.23.0 -pdf2docx>=0.5.6 -reportlab>=4.0.0 -markdown>=3.5.0 -opencv-python>=4.8.0 -Pillow>=10.0.0 -lxml>=4.9.0 -beautifulsoup4>=4.12.0 -``` - -### B. 環境變數配置 - -```bash -# 設定模型下載源 -export PADDLE_PDX_MODEL_SOURCE=HuggingFace # 或 BOS - -# 啟用 GPU -export CUDA_VISIBLE_DEVICES=0 - -# 設定快取目錄 -export PADDLEX_CACHE_DIR=/path/to/cache -``` - -### C. 相關資源 - -- **官方文檔**: https://paddlepaddle.github.io/PaddleOCR/ -- **GitHub**: https://github.com/PaddlePaddle/PaddleOCR -- **模型庫**: https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_ch/models_list.md -- **技術論文**: https://arxiv.org/abs/2507.05595 - ---- - -**文檔版本**: v1.0 -**最後更新**: 2025-11-18 -**作者**: Claude + PaddleOCR 技術團隊 diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/proposal.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/proposal.md deleted file mode 100644 index 7bf2653..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/proposal.md +++ /dev/null @@ -1,148 +0,0 @@ -# Implement Layout-Preserving PDF Generation and Preview - -## Problem - -Testing revealed three critical issues affecting user experience: - -### 1. PDF Download Returns 403 Forbidden -- **Endpoint**: `GET /api/v2/tasks/{task_id}/download/pdf` -- **Error**: Backend returns HTTP 403 Forbidden -- **Impact**: Users cannot download PDF format results -- **Root Cause**: PDF generation service not implemented - -### 2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content -- **Affected Pages**: - - Results page (`/results`) - - Task Detail page (`/tasks/{taskId}`) -- **Current Behavior**: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果" -- **Problem**: Users cannot preview OCR results with original document layout preserved -- **Impact**: Poor user experience - users cannot verify OCR accuracy visually - -### 3. Images Extracted by PP-StructureV3 Are Not Saved to Disk -- **Affected File**: `backend/app/services/ocr_service.py:554-561` -- **Current Behavior**: - - PP-StructureV3 extracts images from documents (tables, charts, figures) - - `analyze_layout()` receives image objects in `markdown_images` dictionary - - Code only saves image path strings to JSON, never saves actual image files - - Result directory contains no `imgs/` folder with extracted images -- **Impact**: - - JSON references non-existent files (e.g., `imgs/img_in_table_box_*.jpg`) - - Layout-preserving PDF cannot embed images because source files don't exist - - Loss of critical visual content from original documents -- **Root Cause**: Missing image file saving logic in `analyze_layout()` function - -## Proposed Changes - -### Change 0: Fix Image Extraction and Saving (PREREQUISITE) -Modify OCR service to save extracted images to disk before PDF generation can embed them. - -**Implementation approach:** -1. **Update `analyze_layout()` Function** - - Locate image saving code at `ocr_service.py:554-561` - - Extract `img_obj` from `markdown_images.items()` - - Create `imgs/` subdirectory in result folder - - Save each `img_obj` to disk using PIL `Image.save()` - - Verify saved file path matches JSON `images_metadata` - -2. **File Naming and Organization** - - PP-StructureV3 generates paths like `imgs/img_in_table_box_145_1253_2329_2488.jpg` - - Create full path: `{result_dir}/{img_path}` - - Ensure parent directories exist before saving - - Handle image format conversion if needed (PNG, JPEG) - -3. **Error Handling** - - Log warnings if image objects are missing or corrupt - - Continue processing even if individual images fail - - Include error info in images_metadata for debugging - -**Why This is Critical:** -- Without saved images, layout-preserving PDF cannot embed visual content -- Images contain crucial information (charts, diagrams, table contents) -- PP-StructureV3 already does the hard work of extraction - we just need to save them - -### Change 1: Implement Layout-Preserving PDF Generation Service -Create a PDF generation service that reconstructs the original document layout from OCR JSON data. - -**Implementation approach:** -1. **Parse JSON OCR Results** - - Read `text_regions` array containing text, bounding boxes, confidence scores - - Extract page dimensions from original file or infer from bbox coordinates - - Group elements by page number - -2. **Generate PDF with ReportLab** - - Create PDF canvas with original page dimensions - - Iterate through each text region - - Draw text at precise coordinates from bbox - - Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans) - - Optionally draw bounding boxes for visualization - -3. **Handle Complex Elements** - - Text: Draw at bbox coordinates with appropriate font size - - Tables: Reconstruct from layout analysis (if available) - - Images: Embed from `images_metadata` - - Preserve rotation/skew from bbox geometry - -4. **Caching Strategy** - - Generate PDF once per task completion - - Store in task result directory as `{filename}_layout.pdf` - - Serve cached version on subsequent requests - - Regenerate only if JSON changes - -**Technical stack:** -- **ReportLab**: PDF generation with precise coordinate control -- **Pillow**: Extract dimensions from source images/PDFs, embed extracted images -- **Chinese fonts**: Noto Sans CJK or Source Han Sans (需安裝) - -### Change 2: Implement In-Browser PDF Preview -Replace placeholder text with interactive PDF preview using react-pdf. - -**Implementation approach:** -1. **Install react-pdf** - ```bash - npm install react-pdf - ``` - -2. **Create PDF Viewer Component** - - Fetch PDF from `/api/v2/tasks/{task_id}/download/pdf` - - Render using `` and `` from react-pdf - - Add zoom controls, page navigation - - Show loading spinner while PDF loads - -3. **Update ResultsPage and TaskDetailPage** - - Replace placeholder with PDF viewer - - Add download button above viewer - - Handle errors gracefully (show error if PDF unavailable) - -**Benefits:** -- Users see OCR results with original layout preserved -- Visual verification of OCR accuracy -- No download required for quick review -- Professional presentation of results - -## Scope - -**In scope:** -- Fix image extraction to save extracted images to disk (PREREQUISITE) -- Implement layout-preserving PDF generation service from JSON -- Install and configure Chinese fonts (Noto Sans CJK) -- Create PDF viewer component with react-pdf -- Add PDF preview to Results page and Task Detail page -- Cache generated PDFs for performance -- Embed extracted images into layout-preserving PDF -- Error handling for image saving, PDF generation and preview failures - -**Out of scope:** -- OCR result editing in preview -- Advanced PDF features (annotations, search, highlights) -- Excel/JSON inline preview -- Real-time PDF regeneration (will use cached version) - -## Impact - -- **User Experience**: Major improvement - layout-preserving visual preview with images -- **Backend**: Significant changes - image saving fix, new PDF generation service -- **Frontend**: Medium changes - PDF viewer integration -- **Dependencies**: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed) -- **Performance**: Medium - PDF generation cached after first request, minimal overhead for image saving -- **Risk**: Medium - complex coordinate transformation, font rendering, image embedding -- **Data Integrity**: High improvement - images now properly preserved alongside text diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/result-export/spec.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/result-export/spec.md deleted file mode 100644 index dd2911e..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/result-export/spec.md +++ /dev/null @@ -1,57 +0,0 @@ -# Result Export - Delta Changes - -## ADDED Requirements - -### Requirement: Image Extraction and Persistence -The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation. - -#### Scenario: Images extracted by PP-StructureV3 are saved to disk -- **WHEN** OCR processes a document containing images (charts, tables, figures) -- **THEN** system SHALL extract image objects from `markdown_images` dictionary -- **AND** system SHALL create `imgs/` subdirectory in result folder -- **AND** system SHALL save each image object to disk using PIL Image.save() -- **AND** saved file paths SHALL match paths recorded in JSON `images_metadata` -- **AND** system SHALL log warnings for failed image saves but continue processing - -#### Scenario: Multi-page documents with images on different pages -- **WHEN** OCR processes multi-page PDF with images on multiple pages -- **THEN** system SHALL save images from all pages to same `imgs/` folder -- **AND** image filenames SHALL include bbox coordinates for uniqueness -- **AND** images SHALL be available for PDF generation after OCR completes - -### Requirement: Layout-Preserving PDF Generation -The system SHALL generate PDF files that preserve the original document layout using OCR JSON data. - -#### Scenario: PDF generated from JSON with accurate layout -- **WHEN** user requests PDF download for a completed task -- **THEN** system SHALL parse OCR JSON result file -- **AND** system SHALL extract bounding box coordinates for each text region -- **AND** system SHALL determine page dimensions from source file or bbox maximum values -- **AND** system SHALL generate PDF with text positioned at precise coordinates -- **AND** system SHALL use Chinese-compatible font (e.g., Noto Sans CJK) -- **AND** system SHALL embed images from `imgs/` folder using paths in `images_metadata` -- **AND** generated PDF SHALL visually resemble original document layout with images - -#### Scenario: PDF download works correctly -- **WHEN** user clicks PDF download button -- **THEN** system SHALL return cached PDF if already generated -- **OR** system SHALL generate new PDF from JSON on first request -- **AND** system SHALL NOT return 403 Forbidden error -- **AND** downloaded PDF SHALL contain task OCR results with layout preserved - -#### Scenario: Multi-page PDF generation -- **WHEN** OCR JSON contains results for multiple pages -- **THEN** generated PDF SHALL contain same number of pages -- **AND** each page SHALL display text regions for that page only -- **AND** page dimensions SHALL match original document pages - -## MODIFIED Requirements - -### Requirement: Export Interface -The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs. - -#### Scenario: PDF caching improves performance -- **WHEN** user downloads same PDF multiple times -- **THEN** system SHALL serve cached PDF file on subsequent requests -- **AND** system SHALL NOT regenerate PDF unless JSON changes -- **AND** download response time SHALL be faster than initial generation diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/task-management/spec.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/task-management/spec.md deleted file mode 100644 index d24c0d7..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/specs/task-management/spec.md +++ /dev/null @@ -1,63 +0,0 @@ -# Task Management - Delta Changes - -## MODIFIED Requirements - -### Requirement: Task Result Display -The system SHALL provide interactive PDF preview of OCR results with layout preservation on Results and Task Detail pages. - -#### Scenario: Results page shows layout-preserving PDF preview -- **WHEN** Results page loads with a completed task -- **THEN** page SHALL fetch PDF from `/api/v2/tasks/{task_id}/download/pdf` -- **AND** page SHALL render PDF using react-pdf PDFViewer component -- **AND** page SHALL NOT show placeholder text "請使用上方下載按鈕..." -- **AND** PDF SHALL display with original document layout preserved -- **AND** PDF SHALL support zoom and page navigation controls - -#### Scenario: Task detail page shows PDF preview -- **WHEN** Task Detail page loads for a completed task -- **THEN** page SHALL fetch layout-preserving PDF -- **AND** page SHALL render PDF using PDFViewer component -- **AND** page SHALL NOT show placeholder text -- **AND** PDF SHALL visually match original document layout - -#### Scenario: Preview handles loading state -- **WHEN** PDF is being generated or fetched -- **THEN** page SHALL display loading spinner -- **AND** page SHALL show progress indicator during PDF generation -- **AND** page SHALL NOT show error or placeholder text - -#### Scenario: Preview handles errors gracefully -- **WHEN** PDF generation fails or file is missing -- **THEN** page SHALL display helpful error message -- **AND** error message SHALL suggest trying download again or contact support -- **AND** page SHALL NOT crash or expose technical errors to user -- **AND** page MAY fallback to markdown preview if PDF unavailable - -## ADDED Requirements - -### Requirement: Interactive PDF Viewer Features -The PDF viewer component SHALL provide essential viewing controls for user convenience. - -#### Scenario: PDF viewer provides zoom controls -- **WHEN** user views PDF preview -- **THEN** viewer SHALL provide zoom in (+) and zoom out (-) buttons -- **AND** viewer SHALL provide fit-to-width option -- **AND** viewer SHALL provide fit-to-page option -- **AND** zoom level SHALL persist during page navigation - -#### Scenario: PDF viewer provides page navigation -- **WHEN** PDF contains multiple pages -- **THEN** viewer SHALL display current page number and total pages -- **AND** viewer SHALL provide previous/next page buttons -- **AND** viewer SHALL provide page selector dropdown -- **AND** page navigation SHALL be smooth without flickering - -### Requirement: Frontend PDF Library Integration -The frontend SHALL use react-pdf for PDF rendering capabilities. - -#### Scenario: react-pdf configured correctly -- **WHEN** application initializes -- **THEN** react-pdf library SHALL be installed and imported -- **AND** PDF.js worker SHALL be configured properly -- **AND** worker path SHALL point to correct pdfjs-dist worker file -- **AND** PDF rendering SHALL work without console errors diff --git a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/tasks.md b/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/tasks.md deleted file mode 100644 index 706192c..0000000 --- a/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/tasks.md +++ /dev/null @@ -1,106 +0,0 @@ -# Implementation Tasks - -## 1. Backend - Fix Image Extraction and Saving (PREREQUISITE) ✅ -- [x] 1.1 Locate `analyze_layout()` function in `backend/app/services/ocr_service.py` -- [x] 1.2 Find image saving code at lines 554-561 where `markdown_images.items()` is iterated -- [x] 1.3 Add code to create `imgs/` subdirectory in result folder before saving images -- [x] 1.4 Extract `img_obj` from `(img_path, img_obj)` tuple in loop -- [x] 1.5 Construct full image file path: `image_path.parent / img_path` -- [x] 1.6 Save each `img_obj` to disk using PIL `Image.save()` method -- [x] 1.7 Add error handling for image save failures (log warning but continue) -- [x] 1.8 Test with document containing images - verify `imgs/` folder created -- [x] 1.9 Verify saved image files match paths in JSON `images_metadata` -- [x] 1.10 Test multi-page PDF with images on different pages - -## 2. Backend - Environment Setup ✅ -- [x] 2.1 Install ReportLab library: `pip install reportlab` -- [x] 2.2 Verify Pillow is already installed (used for image handling) -- [x] 2.3 Download and install Noto Sans CJK font (TrueType format) -- [x] 2.4 Configure font path in backend settings -- [x] 2.5 Test Chinese character rendering - -## 3. Backend - PDF Generation Service ✅ -- [x] 3.1 Create `pdf_generator_service.py` in `app/services/` -- [x] 3.2 Implement `load_ocr_json(json_path)` to parse JSON results -- [x] 3.3 Implement `calculate_page_dimensions(text_regions)` to infer page size from bbox -- [x] 3.4 Implement `get_original_page_size(file_path)` to extract from source file -- [x] 3.5 Implement `draw_text_region(canvas, region, font, page_height)` to render text at bbox -- [x] 3.6 Implement `generate_layout_pdf(json_path, output_path)` main function -- [x] 3.7 Handle coordinate transformation (OCR coords to PDF coords) -- [x] 3.8 Add font size calculation based on bbox height -- [x] 3.9 Handle multi-page documents -- [x] 3.10 Add caching logic (check if PDF already exists) -- [x] 3.11 Implement `draw_table_region(canvas, region)` using ReportLab Table -- [x] 3.12 Implement `draw_image_region(canvas, region)` from images_metadata (reads from saved imgs/) - -## 4. Backend - PDF Download Endpoint Fix ✅ -- [x] 4.1 Update `/tasks/{id}/download/pdf` endpoint in tasks.py router -- [x] 4.2 Check if PDF already exists; if not, trigger on-demand generation -- [x] 4.3 Serve pre-generated PDF file from task result directory -- [x] 4.4 Add error handling for missing PDF or generation failures -- [x] 4.5 Test PDF download endpoint returns 200 with valid PDF - -## 5. Backend - Integrate PDF Generation into OCR Flow (REQUIRED) ✅ -- [x] 5.1 Modify OCR service to generate PDF automatically after JSON creation -- [x] 5.2 Update `save_results()` to return (json_path, markdown_path, pdf_path) -- [x] 5.3 PDF generation integrated into OCR completion flow -- [x] 5.4 PDF generated synchronously during OCR processing (avoids timeout issues) -- [x] 5.5 Test PDF generation triggers automatically after OCR completes - -## 6. Frontend - Install Dependencies ✅ -- [x] 6.1 Install react-pdf: `npm install react-pdf` -- [x] 6.2 Install pdfjs-dist (peer dependency): `npm install pdfjs-dist` -- [x] 6.3 Configure vite for PDF.js worker and optimization - -## 7. Frontend - Create PDF Viewer Component ✅ -- [x] 7.1 Create `PDFViewer.tsx` component in `components/` -- [x] 7.2 Implement Document and Page rendering from react-pdf -- [x] 7.3 Add zoom controls (zoom in/out, 50%-300%) -- [x] 7.4 Add page navigation (previous, next, page counter) -- [x] 7.5 Add loading spinner while PDF loads -- [x] 7.6 Add error boundary for PDF loading failures -- [x] 7.7 Style PDF container with proper sizing and authentication support - -## 8. Frontend - Results Page Integration ✅ -- [x] 8.1 Import PDFViewer component in ResultsPage.tsx -- [x] 8.2 Construct PDF URL from task data -- [x] 8.3 Replace placeholder text with PDFViewer -- [x] 8.4 Add authentication headers (Bearer token) -- [x] 8.5 Test PDF preview rendering - -## 9. Frontend - Task Detail Page Integration ✅ -- [x] 9.1 Import PDFViewer component in TaskDetailPage.tsx -- [x] 9.2 Construct PDF URL from task data -- [x] 9.3 Replace placeholder text with PDFViewer -- [x] 9.4 Add authentication headers (Bearer token) -- [x] 9.5 Test PDF preview rendering - -## 10. Testing ⚠️ (待實際 OCR 任務測試) - -### 基本驗證 (已完成) ✅ -- [x] 10.1 Backend service imports successfully -- [x] 10.2 Frontend TypeScript compilation passes -- [x] 10.3 PDF Generator Service loads correctly -- [x] 10.4 OCR Service loads with image saving updates - -### 功能測試 (需實際 OCR 任務) -- [x] 10.5 Fixed page filtering issue for tables and images (修復表格與圖片頁碼分配錯誤) -- [x] 10.6 Adjusted rendering order (images → tables → text) to prevent overlapping -- [x] 10.7 **Fixed text filtering logic** (使用正確的數據來源 images_metadata,修復文字與表格/圖片重疊問題) -- [ ] 10.8 Test image extraction and saving (verify imgs/ folder created with correct files) -- [ ] 10.8 Test image saving with multi-page PDFs -- [ ] 10.9 Test PDF generation with single-page document -- [ ] 10.10 Test PDF generation with multi-page document -- [ ] 10.11 Test Chinese character rendering in PDF -- [ ] 10.12 Test coordinate accuracy (verify text positioned correctly) -- [ ] 10.13 Test table rendering in PDF (if JSON contains tables) -- [ ] 10.14 Test image embedding in PDF (verify images from imgs/ folder appear correctly) -- [ ] 10.15 Test PDF caching (second request uses cached version) -- [ ] 10.16 Test automatic PDF generation after OCR completion -- [ ] 10.17 Test PDF download from Results page -- [ ] 10.18 Test PDF download from Task Detail page -- [ ] 10.19 Test PDF preview on Results page -- [ ] 10.20 Test PDF preview on Task Detail page -- [ ] 10.21 Test error handling when JSON is missing -- [ ] 10.22 Test error handling when PDF generation fails -- [ ] 10.23 Test error handling when image files are missing or corrupt diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/FRONTEND_IMPLEMENTATION.md b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/FRONTEND_IMPLEMENTATION.md deleted file mode 100644 index 7869e4d..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/FRONTEND_IMPLEMENTATION.md +++ /dev/null @@ -1,519 +0,0 @@ -# 前端實作完成 - External Authentication & Task History - -## 實作日期 -2025-11-14 - -## 狀態 -✅ **前端核心功能完成** -- V2 認證服務整合 -- 登入頁面更新 -- 任務歷史頁面 -- 導航整合 - ---- - -## 📋 已完成項目 - -### 1. V2 API 服務層 ✅ - -#### **檔案:`frontend/src/services/apiV2.ts`** - -**核心功能:** -```typescript -class ApiClientV2 { - // 認證管理 - async login(data: LoginRequest): Promise - async logout(sessionId?: number): Promise - async getMe(): Promise - async listSessions(): Promise - - // 任務管理 - async createTask(data: TaskCreate): Promise - async listTasks(params): Promise - async getTaskStats(): Promise - async getTask(taskId: string): Promise - async updateTask(taskId: string, data: TaskUpdate): Promise - async deleteTask(taskId: string): Promise - - // 輔助方法 - async downloadTaskFile(url: string, filename: string): Promise -} -``` - -**特色:** -- 自動 token 管理(localStorage) -- 401 自動重定向到登入 -- Session 過期檢測 -- 用戶資訊快取 - -#### **檔案:`frontend/src/types/apiV2.ts`** - -完整類型定義: -- `UserInfo`, `LoginResponseV2`, `SessionInfo` -- `Task`, `TaskCreate`, `TaskUpdate`, `TaskDetail` -- `TaskStats`, `TaskListResponse`, `TaskFilters` -- `TaskStatus` 枚舉 - ---- - -### 2. 登入頁面更新 ✅ - -#### **檔案:`frontend/src/pages/LoginPage.tsx`** - -**變更:** -```typescript -// 舊版(V1) -await apiClient.login({ username, password }) -setUser({ id: 1, username }) - -// 新版(V2) -const response = await apiClientV2.login({ username, password }) -setUser({ - id: response.user.id, - username: response.user.email, - email: response.user.email, - displayName: response.user.display_name -}) -``` - -**功能:** -- ✅ 整合外部 Azure AD 認證 -- ✅ 顯示用戶顯示名稱 -- ✅ 錯誤訊息處理 -- ✅ 保持原有 UI 設計 - ---- - -### 3. 任務歷史頁面 ✅ - -#### **檔案:`frontend/src/pages/TaskHistoryPage.tsx`** - -**核心功能:** - -1. **統計儀表板** - - 總計、待處理、處理中、已完成、失敗 - - 卡片式呈現 - - 即時更新 - -2. **篩選功能** - - 按狀態篩選(全部/pending/processing/completed/failed) - - 未來可擴展:日期範圍、檔名搜尋 - -3. **任務列表** - - 分頁顯示(每頁 20 筆) - - 欄位:檔案名稱、狀態、建立時間、完成時間、處理時間 - - 操作:查看詳情、刪除 - -4. **狀態徽章** - ```typescript - pending → 灰色 + 時鐘圖標 - processing → 藍色 + 旋轉圖標 - completed → 綠色 + 勾選圖標 - failed → 紅色 + X 圖標 - ``` - -5. **分頁控制** - - 上一頁/下一頁 - - 顯示當前範圍(1-20 / 共 45 個) - - 自動禁用按鈕 - -**UI 組件使用:** -- `Card` - 統計卡片和主容器 -- `Table` - 任務列表表格 -- `Badge` - 狀態標籤 -- `Button` - 操作按鈕 -- `Select` - 狀態篩選下拉選單 - ---- - -### 4. 路由整合 ✅ - -#### **檔案:`frontend/src/App.tsx`** - -新增路由: -```typescript -} /> -``` - -**路由結構:** -``` -/login - 登入頁面(公開) -/ - 根路徑(重定向到 /upload) - /upload - 上傳檔案 - /processing - 處理進度 - /results - 查看結果 - /tasks - 任務歷史 (NEW!) - /export - 導出文件 - /settings - 系統設定 -``` - ---- - -### 5. 導航更新 ✅ - -#### **檔案:`frontend/src/components/Layout.tsx`** - -**新增導航項:** -```typescript -{ - to: '/tasks', - label: '任務歷史', - icon: History, - description: '查看任務記錄' -} -``` - -**Logout 邏輯更新:** -```typescript -const handleLogout = async () => { - try { - // 優先使用 V2 API - if (apiClientV2.isAuthenticated()) { - await apiClientV2.logout() - } else { - apiClient.logout() - } - } finally { - logout() // 清除本地狀態 - } -} -``` - -**用戶資訊顯示:** -- 顯示名稱:`user.displayName || user.username` -- Email:`user.email || user.username` -- 頭像:首字母大寫 - ---- - -### 6. 類型擴展 ✅ - -#### **檔案:`frontend/src/types/api.ts`** - -擴展 User 介面: -```typescript -export interface User { - id: number - username: string - email?: string // NEW - displayName?: string | null // NEW -} -``` - ---- - -## 🎨 UI/UX 特色 - -### 任務歷史頁面設計亮點: - -1. **響應式卡片佈局** - - Grid 5 欄(桌面)/ 1 欄(手機) - - 統計數據卡片 hover 效果 - -2. **清晰的狀態視覺化** - - 彩色徽章 - - 動畫圖標(processing 狀態旋轉) - - 語意化顏色 - -3. **操作反饋** - - 載入動畫(Loader2) - - 空狀態提示 - - 錯誤警告 - -4. **用戶友好** - - 確認刪除對話框 - - 刷新按鈕 - - 分頁資訊明確 - ---- - -## 🔄 向後兼容 - -### V1 與 V2 並存策略 - -**認證服務:** -- V1: `apiClient` (原有本地認證) -- V2: `apiClientV2` (新外部認證) - -**登入流程:** -- 新用戶使用 V2 API 登入 -- 舊 session 仍可使用 V1 API - -**Logout 處理:** -```typescript -if (apiClientV2.isAuthenticated()) { - await apiClientV2.logout() // 呼叫後端 /api/v2/auth/logout -} else { - apiClient.logout() // 僅清除本地 token -} -``` - ---- - -## 📱 使用流程 - -### 1. 登入 -``` -用戶訪問 /login -→ 輸入 email + password -→ apiClientV2.login() 呼叫外部 API -→ 接收 access_token + user info -→ 存入 localStorage -→ 重定向到 /upload -``` - -### 2. 查看任務歷史 -``` -用戶點擊「任務歷史」導航 -→ 訪問 /tasks -→ apiClientV2.listTasks() 獲取任務列表 -→ apiClientV2.getTaskStats() 獲取統計 -→ 顯示任務表格 + 統計卡片 -``` - -### 3. 篩選任務 -``` -用戶選擇狀態篩選器(例:completed) -→ setStatusFilter('completed') -→ useEffect 觸發重新 fetchTasks() -→ 呼叫 apiClientV2.listTasks({ status: 'completed' }) -→ 更新任務列表 -``` - -### 4. 刪除任務 -``` -用戶點擊刪除按鈕 -→ 確認對話框 -→ apiClientV2.deleteTask(taskId) -→ 重新載入任務列表和統計 -``` - -### 5. 分頁導航 -``` -用戶點擊「下一頁」 -→ setPage(page + 1) -→ useEffect 觸發 fetchTasks() -→ 呼叫 listTasks({ page: 2 }) -→ 更新任務列表 -``` - ---- - -## 🧪 測試指南 - -### 手動測試步驟: - -#### 1. 測試登入 -```bash -# 啟動後端 -cd backend -source venv/bin/activate -python -m app.main - -# 啟動前端 -cd frontend -npm run dev - -# 訪問 http://localhost:5173/login -# 輸入 Azure AD 憑證 -# 確認登入成功並顯示用戶名稱 -``` - -#### 2. 測試任務歷史 -```bash -# 登入後點擊側邊欄「任務歷史」 -# 確認統計卡片顯示正確數字 -# 確認任務列表載入 -# 測試狀態篩選 -# 測試分頁功能 -``` - -#### 3. 測試任務刪除 -```bash -# 在任務列表點擊刪除按鈕 -# 確認刪除確認對話框 -# 確認刪除後列表更新 -# 確認統計數字更新 -``` - -#### 4. 測試 Logout -```bash -# 點擊側邊欄登出按鈕 -# 確認清除 localStorage -# 確認重定向到登入頁面 -# 再次登入確認一切正常 -``` - ---- - -## 🔧 已知限制 - -### 目前未實作項目: - -1. **任務詳情頁面** (`/tasks/:taskId`) - - 顯示完整任務資訊 - - 下載結果檔案(JSON/Markdown/PDF) - - 查看任務文件列表 - -2. **進階篩選** - - 日期範圍選擇器 - - 檔案名稱搜尋 - - 多條件組合篩選 - -3. **批次操作** - - 批次刪除任務 - - 批次下載結果 - -4. **即時更新** - - WebSocket 連接 - - 任務狀態即時推送 - - 自動刷新處理中的任務 - -5. **錯誤詳情** - - 展開查看 `error_message` - - 失敗任務重試功能 - ---- - -## 💡 未來擴展建議 - -### 短期優化(1-2 週): - -1. **任務詳情頁面** - ```typescript - // frontend/src/pages/TaskDetailPage.tsx - const task = await apiClientV2.getTask(taskId) - // 顯示完整資訊 + 下載按鈕 - ``` - -2. **檔案下載** - ```typescript - const handleDownload = async (path: string, filename: string) => { - await apiClientV2.downloadTaskFile(path, filename) - } - ``` - -3. **日期範圍篩選** - ```typescript - { - setDateFrom(range.from) - setDateTo(range.to) - }} - /> - ``` - -### 中期功能(1 個月): - -4. **即時狀態更新** - - 使用 WebSocket 或 Server-Sent Events - - 自動更新 processing 任務狀態 - -5. **批次操作** - - 複選框選擇多個任務 - - 批次刪除/下載 - -6. **搜尋功能** - - 檔案名稱模糊搜尋 - - 全文搜尋(需後端支援) - -### 長期規劃(3 個月): - -7. **任務視覺化** - - 時間軸視圖 - - 甘特圖(處理進度) - - 統計圖表(ECharts) - -8. **通知系統** - - 任務完成通知 - - 錯誤警報 - - 瀏覽器通知 API - -9. **導出功能** - - 任務報表導出(Excel/PDF) - - 統計資料導出 - ---- - -## 📝 程式碼範例 - -### 在其他頁面使用 V2 API - -```typescript -// Example: 在 UploadPage 創建任務 -import { apiClientV2 } from '@/services/apiV2' - -const handleUpload = async (file: File) => { - try { - // 創建任務 - const task = await apiClientV2.createTask({ - filename: file.name, - file_type: file.type - }) - - console.log('Task created:', task.task_id) - - // TODO: 上傳檔案到雲端存儲 - // TODO: 更新任務狀態為 processing - // TODO: 呼叫 OCR 服務 - } catch (error) { - console.error('Upload failed:', error) - } -} -``` - -### 監聽任務狀態變化 - -```typescript -// Example: 輪詢任務狀態 -const pollTaskStatus = async (taskId: string) => { - const interval = setInterval(async () => { - try { - const task = await apiClientV2.getTask(taskId) - - if (task.status === 'completed') { - clearInterval(interval) - alert('任務完成!') - } else if (task.status === 'failed') { - clearInterval(interval) - alert(`任務失敗:${task.error_message}`) - } - } catch (error) { - clearInterval(interval) - console.error('Poll error:', error) - } - }, 5000) // 每 5 秒檢查一次 -} -``` - ---- - -## ✅ 完成清單 - -- [x] V2 API 服務層(`apiV2.ts`) -- [x] V2 類型定義(`apiV2.ts`) -- [x] 登入頁面整合 V2 -- [x] 任務歷史頁面 -- [x] 統計儀表板 -- [x] 狀態篩選 -- [x] 分頁功能 -- [x] 任務刪除 -- [x] 路由整合 -- [x] 導航更新 -- [x] Logout 更新 -- [x] 用戶資訊顯示 -- [ ] 任務詳情頁面(待實作) -- [ ] 檔案下載(待實作) -- [ ] 即時狀態更新(待實作) -- [ ] 批次操作(待實作) - ---- - -**實作完成日期**:2025-11-14 -**實作人員**:Claude Code -**前端框架**:React + TypeScript + Vite -**UI 庫**:Tailwind CSS + shadcn/ui -**狀態管理**:Zustand -**HTTP 客戶端**:Axios diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/IMPLEMENTATION_COMPLETE.md b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/IMPLEMENTATION_COMPLETE.md deleted file mode 100644 index 629ec88..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/IMPLEMENTATION_COMPLETE.md +++ /dev/null @@ -1,556 +0,0 @@ -# External API Authentication Implementation - Complete ✅ - -## 實作日期 -2025-11-14 - -## 狀態 -✅ **後端實作完成** - Phase 1-8 已完成 -⏳ **前端實作待續** - Phase 9-11 待實作 -📋 **測試與文檔** - Phase 12-13 待完成 - ---- - -## 📋 已完成階段 (Phase 1-8) - -### Phase 1: 資料庫架構設計 ✅ - -#### 創建的模型文件: -1. **`backend/app/models/user_v2.py`** - 新用戶模型 - - 資料表:`tool_ocr_users` - - 欄位:`id`, `email`, `display_name`, `created_at`, `last_login`, `is_active` - - 特點:無密碼欄位(外部認證)、email 作為主要識別 - -2. **`backend/app/models/task.py`** - 任務模型 - - 資料表:`tool_ocr_tasks`, `tool_ocr_task_files` - - 任務狀態:PENDING, PROCESSING, COMPLETED, FAILED - - 用戶隔離:外鍵關聯 `user_id`,CASCADE 刪除 - -3. **`backend/app/models/session.py`** - Session 管理 - - 資料表:`tool_ocr_sessions` - - 儲存:access_token, id_token, refresh_token (加密) - - 追蹤:expires_at, ip_address, user_agent, last_accessed_at - -#### 資料庫遷移: -- **檔案**:`backend/alembic/versions/5e75a59fb763_add_external_auth_schema_with_task_.py` -- **狀態**:已套用 (alembic stamp head) -- **變更**:創建 4 個新表 (users, sessions, tasks, task_files) -- **策略**:保留舊表,不刪除(避免外鍵約束錯誤) - ---- - -### Phase 2: 配置管理 ✅ - -#### 環境變數 (`.env.local`): -```bash -# External Authentication -EXTERNAL_AUTH_API_URL=https://pj-auth-api.vercel.app -EXTERNAL_AUTH_ENDPOINT=/api/auth/login -EXTERNAL_AUTH_TIMEOUT=30 -TOKEN_REFRESH_BUFFER=300 - -# Task Management -DATABASE_TABLE_PREFIX=tool_ocr_ -ENABLE_TASK_HISTORY=true -TASK_RETENTION_DAYS=30 -MAX_TASKS_PER_USER=1000 -``` - -#### 配置類 (`backend/app/core/config.py`): -- 新增外部認證配置屬性 -- 新增 `external_auth_full_url` property -- 新增任務管理配置參數 - ---- - -### Phase 3: 服務層實作 ✅ - -#### 1. 外部認證服務 (`backend/app/services/external_auth_service.py`) - -**核心功能:** -```python -class ExternalAuthService: - async def authenticate_user(username, password) -> tuple[bool, AuthResponse, error] - # 呼叫外部 API:POST https://pj-auth-api.vercel.app/api/auth/login - # 重試邏輯:3 次,指數退避 - # 返回:success, auth_data (tokens + user_info), error_msg - - async def validate_token(access_token) -> tuple[bool, payload] - # TODO: 完整 JWT 驗證(簽名、過期時間等) - - def is_token_expiring_soon(expires_at) -> bool - # 檢查是否在 TOKEN_REFRESH_BUFFER 內過期 -``` - -**錯誤處理:** -- HTTP 超時自動重試 -- 5xx 錯誤指數退避 -- 完整日誌記錄 - -#### 2. 任務管理服務 (`backend/app/services/task_service.py`) - -**核心功能:** -```python -class TaskService: - # 創建與查詢 - def create_task(db, user_id, filename, file_type) -> Task - def get_task_by_id(db, task_id, user_id) -> Task # 用戶隔離 - def get_user_tasks(db, user_id, status, skip, limit) -> (tasks, total) - - # 更新 - def update_task_status(db, task_id, user_id, status, error, time_ms) -> Task - def update_task_results(db, task_id, user_id, paths...) -> Task - - # 刪除與清理 - def delete_task(db, task_id, user_id) -> bool - def auto_cleanup_expired_tasks(db) -> int # 根據 TASK_RETENTION_DAYS - - # 統計 - def get_user_stats(db, user_id) -> dict # 按狀態統計 -``` - -**安全特性:** -- 所有查詢強制 `user_id` 過濾 -- 自動任務限額檢查 -- 過期任務自動清理 - ---- - -### Phase 4-6: API 端點實作 ✅ - -#### 1. 認證端點 (`backend/app/routers/auth_v2.py`) - -**路由前綴**:`/api/v2/auth` - -| 端點 | 方法 | 描述 | 認證 | -|------|------|------|------| -| `/login` | POST | 外部 API 登入 | 無 | -| `/logout` | POST | 登出 (刪除 session) | 需要 | -| `/me` | GET | 獲取當前用戶資訊 | 需要 | -| `/sessions` | GET | 列出用戶所有 sessions | 需要 | - -**Login 流程:** -``` -1. 呼叫外部 API 認證 -2. 獲取 access_token, id_token, user_info -3. 在資料庫中創建/更新用戶 (email) -4. 創建 session 記錄 (tokens, IP, user agent) -5. 生成內部 JWT (包含 user_id, session_id) -6. 返回內部 JWT 給前端 -``` - -#### 2. 任務管理端點 (`backend/app/routers/tasks.py`) - -**路由前綴**:`/api/v2/tasks` - -| 端點 | 方法 | 描述 | 認證 | -|------|------|------|------| -| `/` | POST | 創建新任務 | 需要 | -| `/` | GET | 列出用戶任務 (分頁/過濾) | 需要 | -| `/stats` | GET | 獲取任務統計 | 需要 | -| `/{task_id}` | GET | 獲取任務詳情 | 需要 | -| `/{task_id}` | PATCH | 更新任務 | 需要 | -| `/{task_id}` | DELETE | 刪除任務 | 需要 | - -**查詢參數:** -- `status`: pending/processing/completed/failed -- `page`: 頁碼 (從 1 開始) -- `page_size`: 每頁筆數 (max 100) -- `order_by`: 排序欄位 (created_at/updated_at/completed_at) -- `order_desc`: 降序排列 - -#### 3. Schema 定義 - -**認證** (`backend/app/schemas/auth.py`): -- `LoginRequest`: username, password -- `Token`: access_token, token_type, expires_in, user (V2) -- `UserInfo`: id, email, display_name -- `UserResponse`: 完整用戶資訊 -- `TokenData`: JWT payload 結構 - -**任務** (`backend/app/schemas/task.py`): -- `TaskCreate`: filename, file_type -- `TaskUpdate`: status, error_message, paths... -- `TaskResponse`: 任務基本資訊 -- `TaskDetailResponse`: 任務 + 文件列表 -- `TaskListResponse`: 分頁結果 -- `TaskStatsResponse`: 統計數據 - ---- - -### Phase 7: JWT 驗證依賴 ✅ - -#### 更新 `backend/app/core/deps.py` - -**新增 V2 依賴:** -```python -def get_current_user_v2(credentials, db) -> UserV2: - # 1. 解析 JWT token - # 2. 從資料庫查詢用戶 (tool_ocr_users) - # 3. 檢查用戶是否活躍 - # 4. 驗證 session (如果有 session_id) - # 5. 檢查 session 是否過期 - # 6. 更新 last_accessed_at - # 7. 返回用戶對象 - -def get_current_active_user_v2(current_user) -> UserV2: - # 確保用戶處於活躍狀態 -``` - -**安全檢查:** -- JWT 簽名驗證 -- 用戶存在性檢查 -- 用戶活躍狀態檢查 -- Session 有效性檢查 -- Session 過期時間檢查 - ---- - -### Phase 8: 路由註冊 ✅ - -#### 更新 `backend/app/main.py` - -```python -# Legacy V1 routers (保留向後兼容) -from app.routers import auth, ocr, export, translation - -# V2 routers (新外部認證系統) -from app.routers import auth_v2, tasks - -app.include_router(auth.router) # V1: /api/v1/auth -app.include_router(ocr.router) # V1: /api/v1/ocr -app.include_router(export.router) # V1: /api/v1/export -app.include_router(translation.router) # V1: /api/v1/translation - -app.include_router(auth_v2.router) # V2: /api/v2/auth -app.include_router(tasks.router) # V2: /api/v2/tasks -``` - -**版本策略:** -- V1 API 保持不變 (向後兼容) -- V2 API 使用新認證系統 -- 前端可逐步遷移 - ---- - -## 🔐 安全特性 - -### 1. 用戶隔離 -- ✅ 所有任務查詢強制 `user_id` 過濾 -- ✅ 用戶 A 無法訪問用戶 B 的任務 -- ✅ Row-level security 在服務層實施 -- ✅ 外鍵 CASCADE 刪除保證資料一致性 - -### 2. Session 管理 -- ✅ 追蹤 IP 位址和 User Agent -- ✅ 自動過期檢查 -- ✅ 最後訪問時間更新 -- ⚠️ Token 加密待實作 (目前明文儲存) - -### 3. 認證流程 -- ✅ 外部 API 認證 (Azure AD) -- ✅ 內部 JWT 生成 (包含 user_id + session_id) -- ✅ 雙重驗證 (JWT + session 檢查) -- ✅ 錯誤重試機制 (3 次,指數退避) - -### 4. 資料庫安全 -- ✅ 資料表前綴命名空間隔離 (`tool_ocr_`) -- ✅ 索引優化 (email, task_id, status, created_at) -- ✅ 外鍵約束確保參照完整性 -- ✅ 軟刪除支援 (file_deleted flag) - ---- - -## 📊 資料庫架構 - -### 資料表關係圖: -``` -tool_ocr_users (1) - ├── tool_ocr_sessions (N) [FK: user_id, CASCADE] - └── tool_ocr_tasks (N) [FK: user_id, CASCADE] - └── tool_ocr_task_files (N) [FK: task_id, CASCADE] -``` - -### 索引策略: -```sql --- 用戶表 -CREATE INDEX ix_tool_ocr_users_email ON tool_ocr_users(email); -- 登入查詢 -CREATE INDEX ix_tool_ocr_users_is_active ON tool_ocr_users(is_active); - --- Session 表 -CREATE INDEX ix_tool_ocr_sessions_user_id ON tool_ocr_sessions(user_id); -CREATE INDEX ix_tool_ocr_sessions_expires_at ON tool_ocr_sessions(expires_at); -- 過期檢查 -CREATE INDEX ix_tool_ocr_sessions_created_at ON tool_ocr_sessions(created_at); - --- 任務表 -CREATE UNIQUE INDEX ix_tool_ocr_tasks_task_id ON tool_ocr_tasks(task_id); -- UUID 查詢 -CREATE INDEX ix_tool_ocr_tasks_user_id ON tool_ocr_tasks(user_id); -- 用戶查詢 -CREATE INDEX ix_tool_ocr_tasks_status ON tool_ocr_tasks(status); -- 狀態過濾 -CREATE INDEX ix_tool_ocr_tasks_created_at ON tool_ocr_tasks(created_at); -- 排序 -CREATE INDEX ix_tool_ocr_tasks_filename ON tool_ocr_tasks(filename); -- 搜尋 - --- 任務文件表 -CREATE INDEX ix_tool_ocr_task_files_task_id ON tool_ocr_task_files(task_id); -CREATE INDEX ix_tool_ocr_task_files_file_hash ON tool_ocr_task_files(file_hash); -- 去重 -``` - ---- - -## 🧪 測試端點 (Swagger UI) - -### 訪問 API 文檔: -``` -http://localhost:8000/docs -``` - -### 測試流程: - -#### 1. 登入測試 -```bash -POST /api/v2/auth/login -Content-Type: application/json - -{ - "username": "user@example.com", - "password": "your_password" -} - -# 成功回應: -{ - "access_token": "eyJhbGc...", - "token_type": "bearer", - "expires_in": 86400, - "user": { - "id": 1, - "email": "user@example.com", - "display_name": "User Name" - } -} -``` - -#### 2. 獲取當前用戶 -```bash -GET /api/v2/auth/me -Authorization: Bearer eyJhbGc... - -# 回應: -{ - "id": 1, - "email": "user@example.com", - "display_name": "User Name", - "created_at": "2025-11-14T16:00:00", - "last_login": "2025-11-14T16:30:00", - "is_active": true -} -``` - -#### 3. 創建任務 -```bash -POST /api/v2/tasks/ -Authorization: Bearer eyJhbGc... -Content-Type: application/json - -{ - "filename": "document.pdf", - "file_type": "application/pdf" -} - -# 回應: -{ - "id": 1, - "user_id": 1, - "task_id": "550e8400-e29b-41d4-a716-446655440000", - "filename": "document.pdf", - "file_type": "application/pdf", - "status": "pending", - "created_at": "2025-11-14T16:35:00", - ... -} -``` - -#### 4. 列出任務 -```bash -GET /api/v2/tasks/?status=completed&page=1&page_size=10 -Authorization: Bearer eyJhbGc... - -# 回應: -{ - "tasks": [...], - "total": 25, - "page": 1, - "page_size": 10, - "has_more": true -} -``` - -#### 5. 獲取統計 -```bash -GET /api/v2/tasks/stats -Authorization: Bearer eyJhbGc... - -# 回應: -{ - "total": 25, - "pending": 3, - "processing": 2, - "completed": 18, - "failed": 2 -} -``` - ---- - -## ⚠️ 待實作項目 - -### 高優先級 (阻塞性): -1. **Token 加密** - Session 表中的 tokens 目前明文儲存 - - 需要:AES-256 加密 - - 位置:`backend/app/routers/auth_v2.py` login endpoint - -2. **完整 JWT 驗證** - 目前僅解碼,未驗證簽名 - - 需要:Azure AD 公鑰驗證 - - 位置:`backend/app/services/external_auth_service.py` - -3. **前端實作** - Phase 9-11 - - 認證服務 (token 管理) - - 任務歷史 UI 頁面 - - API 整合 - -### 中優先級 (功能性): -4. **Token 刷新機制** - 自動刷新即將過期的 token -5. **檔案上傳整合** - 將 OCR 服務與新任務系統整合 -6. **任務通知** - 任務完成時通知用戶 -7. **錯誤追蹤** - 詳細的錯誤日誌和監控 - -### 低優先級 (優化): -8. **效能測試** - 大量任務的查詢效能 -9. **快取層** - Redis 快取用戶 session -10. **API 速率限制** - 防止濫用 -11. **文檔生成** - 自動生成 API 文檔 - ---- - -## 📝 遷移指南 (前端開發者) - -### 1. 更新登入流程 - -**舊 V1 方式:** -```typescript -// V1: Local authentication -const response = await fetch('/api/v1/auth/login', { - method: 'POST', - body: JSON.stringify({ username, password }) -}); -const { access_token } = await response.json(); -``` - -**新 V2 方式:** -```typescript -// V2: External Azure AD authentication -const response = await fetch('/api/v2/auth/login', { - method: 'POST', - body: JSON.stringify({ username, password }) // Same interface! -}); -const { access_token, user } = await response.json(); - -// Store token and user info -localStorage.setItem('token', access_token); -localStorage.setItem('user', JSON.stringify(user)); -``` - -### 2. 使用新的任務 API - -```typescript -// 獲取任務列表 -const response = await fetch('/api/v2/tasks/?page=1&page_size=20', { - headers: { - 'Authorization': `Bearer ${token}` - } -}); -const { tasks, total, has_more } = await response.json(); - -// 獲取統計 -const statsResponse = await fetch('/api/v2/tasks/stats', { - headers: { 'Authorization': `Bearer ${token}` } -}); -const stats = await statsResponse.json(); -// { total: 25, pending: 3, processing: 2, completed: 18, failed: 2 } -``` - -### 3. 處理認證錯誤 - -```typescript -const response = await fetch('/api/v2/tasks/', { - headers: { 'Authorization': `Bearer ${token}` } -}); - -if (response.status === 401) { - // Token 過期或無效,重新登入 - if (data.detail === "Session expired, please login again") { - // 清除本地 token,導向登入頁 - localStorage.removeItem('token'); - window.location.href = '/login'; - } -} -``` - ---- - -## 🔍 除錯與監控 - -### 日誌位置: -``` -./logs/app.log -``` - -### 重要日誌事件: -- `Authentication successful for user: {email}` - 登入成功 -- `Created session {id} for user {email}` - Session 創建 -- `Authenticated user: {email} (ID: {id})` - JWT 驗證成功 -- `Expired session {id} for user {email}` - Session 過期 -- `Created task {task_id} for user {email}` - 任務創建 - -### 資料庫查詢: -```sql --- 檢查用戶 -SELECT * FROM tool_ocr_users WHERE email = 'user@example.com'; - --- 檢查 sessions -SELECT * FROM tool_ocr_sessions WHERE user_id = 1 ORDER BY created_at DESC; - --- 檢查任務 -SELECT * FROM tool_ocr_tasks WHERE user_id = 1 ORDER BY created_at DESC LIMIT 10; - --- 統計 -SELECT status, COUNT(*) FROM tool_ocr_tasks WHERE user_id = 1 GROUP BY status; -``` - ---- - -## ✅ 總結 - -### 已完成: -- ✅ 完整的資料庫架構設計 (4 個新表) -- ✅ 外部 API 認證服務整合 -- ✅ 用戶 Session 管理系統 -- ✅ 任務管理服務 (CRUD + 隔離) -- ✅ RESTful API 端點 (認證 + 任務) -- ✅ JWT 驗證依賴項 -- ✅ 資料庫遷移腳本 -- ✅ API Schema 定義 - -### 待繼續: -- ⏳ 前端認證服務 -- ⏳ 前端任務歷史 UI -- ⏳ 整合測試 -- ⏳ 文檔更新 - -### 技術債務: -- ⚠️ Token 加密 (高優先級) -- ⚠️ 完整 JWT 驗證 (高優先級) -- ⚠️ Token 刷新機制 - ---- - -**實作完成日期**:2025-11-14 -**實作人員**:Claude Code -**審核狀態**:待用戶測試與審核 diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/PROGRESS_UPDATE.md b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/PROGRESS_UPDATE.md deleted file mode 100644 index cbd46f4..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/PROGRESS_UPDATE.md +++ /dev/null @@ -1,304 +0,0 @@ -# Migration Progress Update - 2025-11-14 - -## 概述 -外部 Azure AD 認證遷移的核心功能已完成 **80%**。所有後端 API 和主要前端功能均已實作並可運行。 - ---- - -## ✅ 已完成功能 (Completed) - -### 1. 數據庫架構重設計 ✅ **100% 完成** -- ✅ 1.3 使用 `tool_ocr_` 前綴創建新數據庫架構 -- ✅ 1.4 創建 SQLAlchemy 模型 - - `backend/app/models/user_v2.py` - 用戶模型(email 作為主鍵) - - `backend/app/models/task.py` - 任務模型(含用戶隔離) - - `backend/app/models/session.py` - 會話管理模型 - - `backend/app/models/audit_log.py` - 審計日誌模型 -- ✅ 1.5 生成 Alembic 遷移腳本 - - `5e75a59fb763_add_external_auth_schema_with_task_.py` - -### 2. 配置管理 ✅ **100% 完成** -- ✅ 2.1 更新環境配置 - - 添加 `EXTERNAL_AUTH_API_URL` - - 添加 `EXTERNAL_AUTH_ENDPOINT` - - 添加 `TOKEN_REFRESH_BUFFER` - - 添加任務管理相關設定 -- ✅ 2.2 更新 Settings 類 - - `backend/app/core/config.py` 已更新所有新配置 - -### 3. 外部 API 集成服務 ✅ **100% 完成** -- ✅ 3.1-3.3 創建認證 API 客戶端 - - `backend/app/services/external_auth_service.py` - - 實作 `authenticate_user()`, `is_token_expiring_soon()` - - 包含重試邏輯和超時處理 - -### 4. 後端認證更新 ✅ **100% 完成** -- ✅ 4.1 修改登錄端點 - - `backend/app/routers/auth_v2.py` - - 完整的外部 API 認證流程 - - 用戶自動創建/更新 -- ✅ 4.2-4.3 更新 Token 驗證 - - `backend/app/core/deps.py` - - `get_current_user_v2()` 依賴注入 - - `get_current_admin_user_v2()` 管理員權限檢查 - -### 5. 會話和 Token 管理 ✅ **100% 完成** -- ✅ 5.1 實作 Token 存儲 - - 存儲於 `tool_ocr_sessions` 表 - - 記錄 IP 地址、User-Agent、過期時間 -- ✅ 5.2 創建 Token 刷新機制 - - **前端**: 自動在過期前 5 分鐘刷新 - - **後端**: `POST /api/v2/auth/refresh` 端點 - - **功能**: 自動重試 401 錯誤 -- ✅ 5.3 會話失效 - - `POST /api/v2/auth/logout` 支持單個/全部會話登出 - -### 6. 前端更新 ✅ **90% 完成** -- ✅ 6.1 更新認證服務 - - `frontend/src/services/apiV2.ts` - 完整 V2 API 客戶端 - - 自動 Token 刷新和重試機制 -- ✅ 6.2 更新認證 Store - - `frontend/src/store/authStore.ts` 存儲用戶信息 -- ✅ 6.3 更新 UI 組件 - - `frontend/src/pages/LoginPage.tsx` 整合 V2 登錄 - - `frontend/src/components/Layout.tsx` 顯示用戶名稱和登出 -- ✅ 6.4 錯誤處理 - - 完整的錯誤顯示和重試邏輯 - -### 7. 任務管理系統 ✅ **100% 完成** -- ✅ 7.1 創建任務管理後端 - - `backend/app/services/task_service.py` - - 完整的 CRUD 操作和用戶隔離 -- ✅ 7.2 實作任務 API - - `backend/app/routers/tasks.py` - - `GET /api/v2/tasks` - 任務列表(含分頁) - - `GET /api/v2/tasks/{id}` - 任務詳情 - - `DELETE /api/v2/tasks/{id}` - 刪除任務 - - `POST /api/v2/tasks/{id}/start` - 開始任務 - - `POST /api/v2/tasks/{id}/cancel` - 取消任務 - - `POST /api/v2/tasks/{id}/retry` - 重試任務 -- ✅ 7.3 創建任務歷史端點 - - `GET /api/v2/tasks/stats` - 用戶統計 - - 支持狀態、檔名、日期範圍篩選 -- ✅ 7.4 實作檔案訪問控制 - - `backend/app/services/file_access_service.py` - - 驗證用戶所有權 - - 檢查任務狀態和檔案存在性 -- ✅ 7.5 檔案下載功能 - - `GET /api/v2/tasks/{id}/download/json` - - `GET /api/v2/tasks/{id}/download/markdown` - - `GET /api/v2/tasks/{id}/download/pdf` - -### 8. 前端任務管理 UI ✅ **100% 完成** -- ✅ 8.1 創建任務歷史頁面 - - `frontend/src/pages/TaskHistoryPage.tsx` - - 完整的任務列表和狀態指示器 - - 分頁控制 -- ✅ 8.3 創建篩選組件 - - 狀態篩選下拉選單 - - 檔名搜尋輸入框 - - 日期範圍選擇器(開始/結束) - - 清除篩選按鈕 -- ✅ 8.4-8.5 任務管理服務 - - `frontend/src/services/apiV2.ts` 整合所有任務 API - - 完整的錯誤處理和重試邏輯 -- ✅ 8.6 更新導航 - - `frontend/src/App.tsx` 添加 `/tasks` 路由 - - `frontend/src/components/Layout.tsx` 添加"任務歷史"選單 - -### 9. 用戶隔離和安全 ✅ **100% 完成** -- ✅ 9.1-9.2 用戶上下文和查詢隔離 - - 所有任務查詢自動過濾 `user_id` - - 嚴格的用戶所有權驗證 -- ✅ 9.3 檔案系統隔離 - - 下載前驗證檔案路徑 - - 檢查用戶所有權 -- ✅ 9.4 API 授權 - - 所有 V2 端點使用 `get_current_user_v2` 依賴 - - 403 錯誤處理未授權訪問 - -### 10. 管理員功能 ✅ **100% 完成(後端)** -- ✅ 10.1 管理員權限系統 - - `backend/app/services/admin_service.py` - - 管理員郵箱: `ymirliu@panjit.com.tw` - - `get_current_admin_user_v2()` 依賴注入 -- ✅ 10.2 系統統計 API - - `GET /api/v2/admin/stats` - 系統總覽統計 - - `GET /api/v2/admin/users` - 用戶列表(含統計) - - `GET /api/v2/admin/users/top` - 用戶排行榜 -- ✅ 10.3 審計日誌系統 - - `backend/app/models/audit_log.py` - 審計日誌模型 - - `backend/app/services/audit_service.py` - 審計服務 - - `GET /api/v2/admin/audit-logs` - 審計日誌查詢 - - `GET /api/v2/admin/audit-logs/user/{id}/summary` - 用戶活動摘要 -- ✅ 10.4 管理員路由註冊 - - `backend/app/routers/admin.py` - - 已在 `backend/app/main.py` 中註冊 - ---- - -## 🚧 進行中 / 待完成 (In Progress / Pending) - -### 11. 數據庫遷移 ⚠️ **待執行** -- ⏳ 11.1 創建審計日誌表遷移 - - 需要: `alembic revision` 創建 `tool_ocr_audit_logs` 表 - - 表結構已在 `audit_log.py` 中定義 -- ⏳ 11.2 執行遷移 - - 運行 `alembic upgrade head` - -### 12. 前端管理員頁面 ⏳ **20% 完成** -- ⏳ 12.1 管理員儀表板頁面 - - 需要: `frontend/src/pages/AdminDashboardPage.tsx` - - 顯示系統統計(用戶、任務、會話、活動) - - 用戶列表和排行榜 -- ⏳ 12.2 審計日誌查看器 - - 需要: `frontend/src/pages/AuditLogsPage.tsx` - - 顯示審計日誌列表 - - 支持篩選(用戶、類別、日期範圍) - - 用戶活動摘要 -- ⏳ 12.3 管理員路由和導航 - - 更新 `App.tsx` 添加管理員路由 - - 在 `Layout.tsx` 中顯示管理員選單(僅管理員可見) - -### 13. 測試 ⏳ **未開始** -- 所有功能需要完整測試 -- 建議優先測試核心認證和任務管理流程 - -### 14. 文檔 ⏳ **部分完成** -- ✅ 已創建實作報告 -- ⏳ 需要更新 API 文檔 -- ⏳ 需要創建用戶使用指南 - ---- - -## 📊 完成度統計 - -| 模組 | 完成度 | 狀態 | -|------|--------|------| -| 數據庫架構 | 100% | ✅ 完成 | -| 配置管理 | 100% | ✅ 完成 | -| 外部 API 集成 | 100% | ✅ 完成 | -| 後端認證 | 100% | ✅ 完成 | -| Token 管理 | 100% | ✅ 完成 | -| 前端認證 | 90% | ✅ 基本完成 | -| 任務管理後端 | 100% | ✅ 完成 | -| 任務管理前端 | 100% | ✅ 完成 | -| 用戶隔離 | 100% | ✅ 完成 | -| 管理員功能(後端) | 100% | ✅ 完成 | -| 管理員功能(前端) | 20% | ⏳ 待開發 | -| 數據庫遷移 | 90% | ⚠️ 待執行 | -| 測試 | 0% | ⏳ 待開始 | -| 文檔 | 50% | ⏳ 進行中 | - -**總體完成度: 80%** - ---- - -## 🎯 核心成就 - -### 1. Token 自動刷新機制 🎉 -- **前端**: 自動在過期前 5 分鐘刷新,無縫體驗 -- **後端**: `/api/v2/auth/refresh` 端點 -- **錯誤處理**: 401 自動重試機制 - -### 2. 完整的任務管理系統 🎉 -- **任務操作**: 開始/取消/重試/刪除 -- **任務篩選**: 狀態/檔名/日期範圍 -- **檔案下載**: JSON/Markdown/PDF 三種格式 -- **訪問控制**: 嚴格的用戶隔離和權限驗證 - -### 3. 管理員監控系統 🎉 -- **系統統計**: 用戶、任務、會話、活動統計 -- **用戶管理**: 用戶列表、排行榜 -- **審計日誌**: 完整的事件記錄和查詢系統 - -### 4. 安全性增強 🎉 -- **用戶隔離**: 所有查詢自動過濾用戶 ID -- **檔案訪問控制**: 驗證所有權和任務狀態 -- **審計追蹤**: 記錄所有重要操作 - ---- - -## 📝 重要檔案清單 - -### 後端新增檔案 -``` -backend/app/models/ -├── user_v2.py # 用戶模型(外部認證) -├── task.py # 任務模型 -├── session.py # 會話模型 -└── audit_log.py # 審計日誌模型 - -backend/app/services/ -├── external_auth_service.py # 外部認證服務 -├── task_service.py # 任務管理服務 -├── file_access_service.py # 檔案訪問控制 -├── admin_service.py # 管理員服務 -└── audit_service.py # 審計日誌服務 - -backend/app/routers/ -├── auth_v2.py # V2 認證路由 -├── tasks.py # 任務管理路由 -└── admin.py # 管理員路由 - -backend/alembic/versions/ -└── 5e75a59fb763_add_external_auth_schema_with_task_.py -``` - -### 前端新增/修改檔案 -``` -frontend/src/services/ -└── apiV2.ts # 完整 V2 API 客戶端 - -frontend/src/pages/ -├── LoginPage.tsx # 整合 V2 登錄 -└── TaskHistoryPage.tsx # 任務歷史頁面 - -frontend/src/components/ -└── Layout.tsx # 導航和用戶資訊 - -frontend/src/types/ -└── apiV2.ts # V2 類型定義 -``` - ---- - -## 🚀 下一步行動 - -### 立即執行 -1. ✅ **提交當前進度** - 所有核心功能已實作 -2. **執行數據庫遷移** - 運行 Alembic 遷移添加 audit_logs 表 -3. **系統測試** - 測試認證流程和任務管理功能 - -### 可選增強 -1. **前端管理員頁面** - 管理員儀表板和審計日誌查看器 -2. **完整測試套件** - 單元測試和集成測試 -3. **性能優化** - 查詢優化和緩存策略 - ---- - -## 🔒 安全注意事項 - -### 已實作 -- ✅ 用戶隔離(Row-level security) -- ✅ 檔案訪問控制 -- ✅ Token 過期檢查 -- ✅ 管理員權限驗證 -- ✅ 審計日誌記錄 - -### 待實作(可選) -- ⏳ Token 加密存儲 -- ⏳ 速率限制 -- ⏳ CSRF 保護增強 - ---- - -## 📞 聯繫資訊 - -**管理員郵箱**: ymirliu@panjit.com.tw -**外部認證 API**: https://pj-auth-api.vercel.app - ---- - -*最後更新: 2025-11-14* -*實作者: Claude Code* diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/database_schema.sql b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/database_schema.sql deleted file mode 100644 index 13a58e6..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/database_schema.sql +++ /dev/null @@ -1,183 +0,0 @@ --- Tool_OCR Database Schema with External API Authentication --- Version: 2.0.0 --- Date: 2025-11-14 --- Description: Complete database redesign with user task isolation and history - --- ============================================ --- Drop existing tables (if needed) --- ============================================ --- Uncomment these lines to drop existing tables --- DROP TABLE IF EXISTS tool_ocr_sessions; --- DROP TABLE IF EXISTS tool_ocr_task_files; --- DROP TABLE IF EXISTS tool_ocr_tasks; --- DROP TABLE IF EXISTS tool_ocr_users; - --- ============================================ --- 1. Users Table --- ============================================ -CREATE TABLE IF NOT EXISTS tool_ocr_users ( - id INT PRIMARY KEY AUTO_INCREMENT, - email VARCHAR(255) UNIQUE NOT NULL COMMENT 'Primary identifier from Azure AD', - display_name VARCHAR(255) COMMENT 'Display name from API response', - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - last_login TIMESTAMP NULL, - is_active BOOLEAN DEFAULT TRUE, - INDEX idx_email (email), - INDEX idx_active (is_active) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci -COMMENT='User accounts authenticated via external API'; - --- ============================================ --- 2. OCR Tasks Table --- ============================================ -CREATE TABLE IF NOT EXISTS tool_ocr_tasks ( - id INT PRIMARY KEY AUTO_INCREMENT, - user_id INT NOT NULL COMMENT 'Foreign key to users table', - task_id VARCHAR(255) UNIQUE NOT NULL COMMENT 'Unique task identifier (UUID)', - filename VARCHAR(255), - file_type VARCHAR(50), - status ENUM('pending', 'processing', 'completed', 'failed') DEFAULT 'pending', - result_json_path VARCHAR(500) COMMENT 'Path to JSON result file', - result_markdown_path VARCHAR(500) COMMENT 'Path to Markdown result file', - result_pdf_path VARCHAR(500) COMMENT 'Path to searchable PDF file', - error_message TEXT COMMENT 'Error details if task failed', - processing_time_ms INT COMMENT 'Processing time in milliseconds', - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, - completed_at TIMESTAMP NULL, - file_deleted BOOLEAN DEFAULT FALSE COMMENT 'Track if files were auto-deleted', - FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE, - INDEX idx_user_status (user_id, status), - INDEX idx_created (created_at), - INDEX idx_task_id (task_id), - INDEX idx_filename (filename) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci -COMMENT='OCR processing tasks with user association'; - --- ============================================ --- 3. Task Files Table --- ============================================ -CREATE TABLE IF NOT EXISTS tool_ocr_task_files ( - id INT PRIMARY KEY AUTO_INCREMENT, - task_id INT NOT NULL COMMENT 'Foreign key to tasks table', - original_name VARCHAR(255), - stored_path VARCHAR(500) COMMENT 'Actual file path on server', - file_size BIGINT COMMENT 'File size in bytes', - mime_type VARCHAR(100), - file_hash VARCHAR(64) COMMENT 'SHA256 hash for deduplication', - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - FOREIGN KEY (task_id) REFERENCES tool_ocr_tasks(id) ON DELETE CASCADE, - INDEX idx_task (task_id), - INDEX idx_hash (file_hash) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci -COMMENT='Files associated with OCR tasks'; - --- ============================================ --- 4. Sessions Table (Token Storage) --- ============================================ -CREATE TABLE IF NOT EXISTS tool_ocr_sessions ( - id INT PRIMARY KEY AUTO_INCREMENT, - user_id INT NOT NULL COMMENT 'Foreign key to users table', - session_id VARCHAR(255) UNIQUE NOT NULL COMMENT 'Unique session identifier', - access_token TEXT COMMENT 'Azure AD access token (encrypted)', - id_token TEXT COMMENT 'Azure AD ID token (encrypted)', - refresh_token TEXT COMMENT 'Refresh token if available', - expires_at TIMESTAMP NOT NULL COMMENT 'Token expiration time', - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, - is_active BOOLEAN DEFAULT TRUE, - ip_address VARCHAR(45) COMMENT 'Client IP address', - user_agent TEXT COMMENT 'Client user agent', - FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE, - INDEX idx_user (user_id), - INDEX idx_session (session_id), - INDEX idx_expires (expires_at), - INDEX idx_active (is_active) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci -COMMENT='User session and token management'; - --- ============================================ --- 5. Audit Log Table (Optional) --- ============================================ -CREATE TABLE IF NOT EXISTS tool_ocr_audit_logs ( - id BIGINT PRIMARY KEY AUTO_INCREMENT, - user_id INT COMMENT 'User who performed the action', - action VARCHAR(100) NOT NULL COMMENT 'Action performed', - entity_type VARCHAR(50) COMMENT 'Type of entity affected', - entity_id INT COMMENT 'ID of entity affected', - details JSON COMMENT 'Additional details in JSON format', - ip_address VARCHAR(45), - user_agent TEXT, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - INDEX idx_user (user_id), - INDEX idx_action (action), - INDEX idx_created (created_at) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci -COMMENT='Audit trail for all system actions'; - --- ============================================ --- Views for Common Queries --- ============================================ - --- User task statistics view -CREATE OR REPLACE VIEW tool_ocr_user_stats AS -SELECT - u.id as user_id, - u.email, - u.display_name, - COUNT(DISTINCT t.id) as total_tasks, - SUM(CASE WHEN t.status = 'completed' THEN 1 ELSE 0 END) as completed_tasks, - SUM(CASE WHEN t.status = 'failed' THEN 1 ELSE 0 END) as failed_tasks, - SUM(CASE WHEN t.status = 'processing' THEN 1 ELSE 0 END) as processing_tasks, - SUM(CASE WHEN t.status = 'pending' THEN 1 ELSE 0 END) as pending_tasks, - AVG(t.processing_time_ms) as avg_processing_time_ms, - MAX(t.created_at) as last_task_created -FROM tool_ocr_users u -LEFT JOIN tool_ocr_tasks t ON u.id = t.user_id -GROUP BY u.id, u.email, u.display_name; - --- Recent tasks view -CREATE OR REPLACE VIEW tool_ocr_recent_tasks AS -SELECT - t.*, - u.email as user_email, - u.display_name as user_name -FROM tool_ocr_tasks t -INNER JOIN tool_ocr_users u ON t.user_id = u.id -ORDER BY t.created_at DESC -LIMIT 100; - --- ============================================ --- Stored Procedures (Optional) --- ============================================ - -DELIMITER $$ - --- Procedure to clean up expired sessions -CREATE PROCEDURE IF NOT EXISTS cleanup_expired_sessions() -BEGIN - DELETE FROM tool_ocr_sessions - WHERE expires_at < NOW() OR is_active = FALSE; -END$$ - --- Procedure to clean up old tasks -CREATE PROCEDURE IF NOT EXISTS cleanup_old_tasks(IN days_to_keep INT) -BEGIN - UPDATE tool_ocr_tasks - SET file_deleted = TRUE - WHERE created_at < DATE_SUB(NOW(), INTERVAL days_to_keep DAY) - AND status IN ('completed', 'failed'); -END$$ - -DELIMITER ; - --- ============================================ --- Initial Data (Optional) --- ============================================ --- Add any initial data here if needed - --- ============================================ --- Grants (Adjust as needed) --- ============================================ --- GRANT ALL PRIVILEGES ON tool_ocr_* TO 'tool_ocr_user'@'localhost'; --- FLUSH PRIVILEGES; \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/proposal.md b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/proposal.md deleted file mode 100644 index 21bcccb..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/proposal.md +++ /dev/null @@ -1,294 +0,0 @@ -# Change: Migrate to External API Authentication - -## Why - -The current local database authentication system has several limitations: -- User credentials are managed locally, requiring manual user creation and password management -- No centralized authentication with enterprise identity systems -- Cannot leverage existing enterprise authentication infrastructure (e.g., Microsoft Azure AD) -- No single sign-on (SSO) capability -- Increased maintenance overhead for user management - -By migrating to the external API authentication service at https://pj-auth-api.vercel.app, the system will: -- Integrate with enterprise Microsoft Azure AD authentication -- Enable single sign-on (SSO) for users -- Eliminate local password management -- Leverage existing enterprise user management and security policies -- Reduce maintenance overhead -- Provide consistent authentication across multiple applications - -## What Changes - -### Authentication Flow -- **Current**: Local database authentication using username/password stored in MySQL -- **New**: External API authentication via POST to `https://pj-auth-api.vercel.app/api/auth/login` -- **Token Management**: Use JWT tokens from external API instead of locally generated tokens -- **User Display**: Use `name` field from API response for user display instead of local username - -### API Integration -**Endpoint**: `POST https://pj-auth-api.vercel.app/api/auth/login` - -**Request Format**: -```json -{ - "username": "user@domain.com", - "password": "user_password" -} -``` - -**Success Response (200)**: -```json -{ - "success": true, - "message": "認證成功", - "data": { - "access_token": "eyJ0eXAiOiJKV1Q...", - "id_token": "eyJ0eXAiOiJKV1Q...", - "expires_in": 4999, - "token_type": "Bearer", - "userInfo": { - "id": "42cf0b98-f598-47dd-ae2a-f33803f87d41", - "name": "ymirliu 劉念萱", - "email": "ymirliu@panjit.com.tw", - "jobTitle": null, - "officeLocation": "高雄", - "businessPhones": ["1580"] - }, - "issuedAt": "2025-11-14T07:09:15.203Z", - "expiresAt": "2025-11-14T08:32:34.203Z" - }, - "timestamp": "2025-11-14T07:09:15.203Z" -} -``` - -**Failure Response (401)**: -```json -{ - "success": false, - "error": "用戶名或密碼錯誤", - "code": "INVALID_CREDENTIALS", - "timestamp": "2025-11-14T07:10:02.585Z" -} -``` - -### Database Schema Changes - -**Complete Redesign (No backward compatibility needed)**: - -**Table Prefix**: `tool_ocr_` (for clear separation from other systems in the same database) - -1. **tool_ocr_users table (redesigned)**: - ```sql - CREATE TABLE tool_ocr_users ( - id INT PRIMARY KEY AUTO_INCREMENT, - email VARCHAR(255) UNIQUE NOT NULL, -- Primary identifier from Azure AD - display_name VARCHAR(255), -- Display name from API response - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - last_login TIMESTAMP, - is_active BOOLEAN DEFAULT TRUE - ); - ``` - Note: No Azure AD ID storage needed - email is sufficient as unique identifier - -2. **tool_ocr_tasks table (new - for task history)**: - ```sql - CREATE TABLE tool_ocr_tasks ( - id INT PRIMARY KEY AUTO_INCREMENT, - user_id INT NOT NULL, -- Foreign key to users table - task_id VARCHAR(255) UNIQUE, -- Unique task identifier - filename VARCHAR(255), - file_type VARCHAR(50), - status ENUM('pending', 'processing', 'completed', 'failed'), - result_json_path VARCHAR(500), - result_markdown_path VARCHAR(500), - error_message TEXT, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, - completed_at TIMESTAMP NULL, - file_deleted BOOLEAN DEFAULT FALSE, -- Track if files were auto-deleted - FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id), - INDEX idx_user_status (user_id, status), - INDEX idx_created (created_at) - ); - ``` - -3. **tool_ocr_task_files table (for multiple files per task)**: - ```sql - CREATE TABLE tool_ocr_task_files ( - id INT PRIMARY KEY AUTO_INCREMENT, - task_id INT NOT NULL, - original_name VARCHAR(255), - stored_path VARCHAR(500), - file_size BIGINT, - mime_type VARCHAR(100), - FOREIGN KEY (task_id) REFERENCES tool_ocr_tasks(id) ON DELETE CASCADE - ); - ``` - -4. **tool_ocr_sessions table (for token management)**: - ```sql - CREATE TABLE tool_ocr_sessions ( - id INT PRIMARY KEY AUTO_INCREMENT, - user_id INT NOT NULL, - access_token TEXT, - id_token TEXT, - expires_at TIMESTAMP, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE, - INDEX idx_user (user_id), - INDEX idx_expires (expires_at) - ); - ``` - -### Session Management -- Store external API tokens in session/cache instead of local JWT -- Implement token refresh mechanism based on `expires_in` field -- Use `expiresAt` timestamp for token expiration validation - -## New Features: User Task Isolation and History - -### Task Isolation -- **Principle**: Each user can only see and access their own tasks -- **Implementation**: All task queries filtered by `user_id` at API level -- **Security**: Enforce user context validation in all task-related endpoints - -### Task History Features -1. **Task Status Tracking**: - - View pending tasks (waiting to process) - - View processing tasks (currently running) - - View completed tasks (with results available) - - View failed tasks (with error messages) - -2. **Historical Query Capabilities**: - - Search tasks by filename - - Filter by date range - - Filter by status - - Sort by creation/completion time - - Pagination for large result sets - -3. **Task Management**: - - Download original files (if not auto-deleted) - - Download results (JSON, Markdown, PDF exports) - - Re-process failed tasks - - Delete old tasks manually - -### Frontend UI Changes -1. **New Components**: - - Task History page/tab - - Task filters and search bar - - Task status badges - - Batch action controls - -2. **Task List View**: - ``` - | Filename | Status | Created | Completed | Actions | - |----------|--------|---------|-----------|---------| - | doc1.pdf | ✅ Completed | 2025-11-14 10:00 | 2025-11-14 10:05 | [Download] [View] | - | doc2.pdf | 🔄 Processing | 2025-11-14 10:10 | - | [Cancel] | - | doc3.pdf | ❌ Failed | 2025-11-14 09:00 | - | [Retry] [View Error] | - ``` - -3. **User Information Display**: - - Show user display name in header - - Show last login time - - Show task statistics (total, completed, failed) - -## Impact - -### Affected Capabilities -- `authentication`: Complete replacement of authentication mechanism -- `user-management`: Simplified to read-only user information from external API -- `session-management`: Modified to handle external tokens -- `task-management`: NEW - User-specific task isolation and history -- `file-access-control`: NEW - User-based file access restrictions - -### Affected Code -- **Backend Authentication**: - - `backend/app/api/v1/endpoints/auth.py`: Replace login logic with external API call - - `backend/app/core/security.py`: Modify token validation to use external tokens - - `backend/app/core/auth.py`: Update authentication dependencies - - `backend/app/services/auth_service.py`: New service for external API integration - -- **Database Models**: - - `backend/app/models/user.py`: Complete redesign with new schema - - `backend/app/models/task.py`: NEW - Task model with user association - - `backend/app/models/task_file.py`: NEW - Task file model - - `backend/alembic/versions/`: Complete database recreation - -- **Task Management APIs** (NEW): - - `backend/app/api/v1/endpoints/tasks.py`: Task CRUD operations with user isolation - - `backend/app/api/v1/endpoints/task_history.py`: Historical query endpoints - - `backend/app/services/task_service.py`: Task business logic - - `backend/app/services/file_access_service.py`: User-based file access control - -- **Frontend**: - - `frontend/src/services/authService.ts`: Update to handle new token format - - `frontend/src/stores/authStore.ts`: Modify to store/display user info from API - - `frontend/src/components/Header.tsx`: Display `name` field and user menu - - `frontend/src/pages/TaskHistory.tsx`: NEW - Task history page - - `frontend/src/components/TaskList.tsx`: NEW - Task list component with filters - - `frontend/src/components/TaskFilters.tsx`: NEW - Search and filter UI - - `frontend/src/stores/taskStore.ts`: NEW - Task state management - - `frontend/src/services/taskService.ts`: NEW - Task API client - -### Dependencies -- Add `httpx` or `aiohttp` for async HTTP requests to external API (already present) -- No new package dependencies required - -### Configuration -- New environment variables: - - `EXTERNAL_AUTH_API_URL` = "https://pj-auth-api.vercel.app" - - `EXTERNAL_AUTH_ENDPOINT` = "/api/auth/login" - - `EXTERNAL_AUTH_TIMEOUT` = 30 (seconds) - - `TOKEN_REFRESH_BUFFER` = 300 (refresh tokens 5 minutes before expiry) - - `TASK_RETENTION_DAYS` = 30 (auto-delete old tasks) - - `MAX_TASKS_PER_USER` = 1000 (limit per user) - - `ENABLE_TASK_HISTORY` = true (enable history feature) - - `DATABASE_TABLE_PREFIX` = "tool_ocr_" (table naming prefix) - -### Security Considerations -- HTTPS required for all authentication requests -- Token storage must be secure (HTTPOnly cookies or secure session storage) -- Implement rate limiting for authentication attempts -- Log all authentication events for audit trail -- Validate SSL certificates for external API calls -- Handle network failures gracefully with appropriate error messages -- **User Isolation**: Enforce user context in all database queries -- **File Access Control**: Validate user ownership before file access -- **API Security**: Add user_id validation in all task-related endpoints - -### Migration Plan (Simplified - No Rollback Needed) -1. **Phase 1**: Backup existing database (for reference only) -2. **Phase 2**: Drop old tables and create new schema -3. **Phase 3**: Deploy new authentication and task management system -4. **Phase 4**: Test with initial users -5. **Phase 5**: Full deployment - -Note: Since this is a test system with no production data to preserve, we can perform a clean migration without rollback concerns. - -## Risks and Mitigations - -### Risks -1. **External API Unavailability**: Authentication service downtime blocks all logins - - *Mitigation*: Implement fallback to local auth, cache tokens, implement retry logic - -2. **Token Expiration Handling**: Users may be logged out unexpectedly - - *Mitigation*: Implement automatic token refresh before expiration - -3. **Network Latency**: Slower authentication due to external API calls - - *Mitigation*: Implement proper timeout handling, async requests, response caching - -4. **Data Consistency**: User information mismatch between local DB and external system - - *Mitigation*: Regular sync jobs, use external system as single source of truth - -5. **Breaking Change**: Existing sessions will be invalidated - - *Mitigation*: Provide migration window, clear communication to users - -## Success Criteria -- All users can authenticate via external API -- Authentication response time < 2 seconds (95th percentile) -- Zero data loss during migration -- Automatic token refresh works without user intervention -- Proper error messages for all failure scenarios -- Audit logs capture all authentication events -- Rollback procedure tested and documented \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/tasks.md b/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/tasks.md deleted file mode 100644 index 40528b6..0000000 --- a/openspec/changes/archive/2025-11-18-migrate-to-external-api-authentication/tasks.md +++ /dev/null @@ -1,276 +0,0 @@ -# Implementation Tasks - -## 1. Database Schema Redesign -- [ ] 1.1 Backup existing database (for reference) - - Export current schema and data - - Document any important data to preserve -- [ ] 1.2 Drop old tables - - Remove existing tables with old naming convention - - Clear database for fresh start -- [ ] 1.3 Create new database schema with `tool_ocr_` prefix - - Create new `tool_ocr_users` table (email as primary identifier) - - Create `tool_ocr_tasks` table with user association - - Create `tool_ocr_task_files` table for file tracking - - Create `tool_ocr_sessions` table for token storage - - Add proper indexes for performance -- [ ] 1.4 Create SQLAlchemy models - - User model (mapped to `tool_ocr_users`) - - Task model (mapped to `tool_ocr_tasks`) - - TaskFile model (mapped to `tool_ocr_task_files`) - - Session model (mapped to `tool_ocr_sessions`) - - Configure table prefix in base model -- [ ] 1.5 Generate Alembic migration - - Create initial migration for new schema - - Test migration script with proper table prefixes - -## 2. Configuration Management -- [ ] 2.1 Update environment configuration - - Add `EXTERNAL_AUTH_API_URL` to `.env.local` - - Add `EXTERNAL_AUTH_ENDPOINT` configuration - - Add `EXTERNAL_AUTH_TIMEOUT` setting - - Add `TOKEN_REFRESH_BUFFER` setting - - Add `TASK_RETENTION_DAYS` for auto-cleanup - - Add `MAX_TASKS_PER_USER` for limits - - Add `ENABLE_TASK_HISTORY` feature flag - - Add `DATABASE_TABLE_PREFIX` = "tool_ocr_" -- [ ] 2.2 Update Settings class - - Add external auth settings to `backend/app/core/config.py` - - Add task management settings - - Add database table prefix configuration - - Add validation for new configuration values - - Remove old authentication settings - -## 3. External API Integration Service -- [ ] 3.1 Create auth API client - - Implement `backend/app/services/external_auth_service.py` - - Create async HTTP client for API calls - - Implement request/response models - - Add proper error handling and logging -- [ ] 3.2 Implement authentication methods - - `authenticate_user()` - Call external API - - `validate_token()` - Verify token validity - - `refresh_token()` - Handle token refresh - - `get_user_info()` - Fetch user details -- [ ] 3.3 Add resilience patterns - - Implement retry logic with exponential backoff - - Add circuit breaker pattern - - Implement timeout handling - - Add fallback mechanisms - -## 4. Backend Authentication Updates -- [ ] 4.1 Modify login endpoint - - Update `backend/app/api/v1/endpoints/auth.py` - - Route to external API based on feature flag - - Handle both authentication modes during transition - - Return appropriate token format -- [ ] 4.2 Update token validation - - Modify `backend/app/core/security.py` - - Support both local and external tokens - - Implement token type detection - - Update JWT validation logic -- [ ] 4.3 Update authentication dependencies - - Modify `backend/app/core/auth.py` - - Update `get_current_user()` dependency - - Handle external user information - - Implement proper user context - -## 5. Session and Token Management -- [ ] 5.1 Implement token storage - - Store external tokens securely - - Implement token encryption at rest - - Handle multiple token types (access, ID, refresh) -- [ ] 5.2 Create token refresh mechanism - - Background task for token refresh - - Refresh tokens before expiration - - Update stored tokens atomically - - Handle refresh failures gracefully -- [ ] 5.3 Session invalidation - - Clear tokens on logout - - Handle token revocation - - Implement session timeout - -## 6. Frontend Updates -- [ ] 6.1 Update authentication service - - Modify `frontend/src/services/authService.ts` - - Handle new token format - - Store user display information - - Implement token refresh on client side -- [ ] 6.2 Update auth store - - Modify `frontend/src/stores/authStore.ts` - - Store external user information - - Update user display logic - - Handle token expiration -- [ ] 6.3 Update UI components - - Modify `frontend/src/components/Header.tsx` - - Display user `name` instead of username - - Show additional user information - - Update login form if needed -- [ ] 6.4 Error handling - - Handle external API errors - - Display appropriate error messages - - Implement retry UI for failures - - Add loading states - -## 7. Task Management System (NEW) -- [ ] 7.1 Create task management backend - - Implement `backend/app/models/task.py` - - Implement `backend/app/models/task_file.py` - - Create `backend/app/services/task_service.py` - - Add task CRUD operations with user isolation -- [ ] 7.2 Implement task APIs - - Create `backend/app/api/v1/endpoints/tasks.py` - - GET /tasks (list user's tasks with pagination) - - GET /tasks/{id} (get specific task) - - DELETE /tasks/{id} (delete task) - - POST /tasks/{id}/retry (retry failed task) -- [ ] 7.3 Create task history endpoints - - Create `backend/app/api/v1/endpoints/task_history.py` - - GET /history (query with filters) - - GET /history/stats (user statistics) - - POST /history/export (export history) -- [ ] 7.4 Implement file access control - - Create `backend/app/services/file_access_service.py` - - Validate user ownership before file access - - Restrict download to user's own files - - Add audit logging for file access -- [ ] 7.5 Update OCR service integration - - Link OCR tasks to user accounts - - Save task records in database - - Update task status during processing - - Store result file paths - -## 8. Frontend Task Management UI (NEW) -- [ ] 8.1 Create task history page - - Implement `frontend/src/pages/TaskHistory.tsx` - - Display task list with status indicators - - Add pagination controls - - Show task details modal -- [ ] 8.2 Build task list component - - Implement `frontend/src/components/TaskList.tsx` - - Display task table with columns - - Add sorting capabilities - - Implement action buttons -- [ ] 8.3 Create filter components - - Implement `frontend/src/components/TaskFilters.tsx` - - Date range picker - - Status filter dropdown - - Search by filename - - Clear filters button -- [ ] 8.4 Add task management store - - Implement `frontend/src/stores/taskStore.ts` - - Manage task list state - - Handle filter state - - Cache task data -- [ ] 8.5 Create task service client - - Implement `frontend/src/services/taskService.ts` - - API methods for task operations - - Handle pagination - - Implement retry logic -- [ ] 8.6 Update navigation - - Add "Task History" menu item - - Update router configuration - - Add task count badge - - Implement user menu with stats - -## 9. User Isolation and Security -- [ ] 9.1 Implement user context middleware - - Create middleware to inject user context - - Validate user in all requests - - Add user_id to logging context -- [ ] 9.2 Database query isolation - - Add user_id filter to all task queries - - Prevent cross-user data access - - Implement row-level security -- [ ] 9.3 File system isolation - - Organize files by user directory - - Validate file paths before access - - Implement cleanup for deleted users -- [ ] 9.4 API authorization - - Add @require_user decorator - - Validate ownership in endpoints - - Return 403 for unauthorized access - -## 10. Testing -- [ ] 10.1 Unit tests - - Test external auth service - - Test token validation - - Test task isolation logic - - Test file access control -- [ ] 10.2 Integration tests - - Test full authentication flow - - Test task management flow - - Test user isolation between accounts - - Test file download restrictions -- [ ] 10.3 Load testing - - Test external API response times - - Test system with many concurrent users - - Test large task history queries - - Measure database query performance -- [ ] 10.4 Security testing - - Test token security - - Verify user isolation - - Test unauthorized access attempts - - Validate SQL injection prevention - -## 11. Migration Execution (Simplified) -- [ ] 11.1 Pre-migration preparation - - Backup existing database (reference only) - - Prepare deployment package - - Set up monitoring -- [ ] 11.2 Execute migration - - Drop old database tables - - Create new schema - - Deploy new code - - Verify system startup -- [ ] 11.3 Post-migration validation - - Test authentication with real users - - Verify task isolation works - - Check task history functionality - - Validate file access controls - -## 12. Documentation -- [ ] 12.1 Technical documentation - - Update API documentation with new endpoints - - Document authentication flow - - Document task management APIs - - Create troubleshooting guide -- [ ] 12.2 User documentation - - Update login instructions - - Document task history features - - Explain user isolation - - Create user guide for new UI -- [ ] 12.3 Developer documentation - - Document database schema - - Explain security model - - Provide integration examples - -## 13. Monitoring and Observability -- [ ] 13.1 Add monitoring metrics - - Authentication success/failure rates - - Task creation/completion rates - - User activity metrics - - File storage usage -- [ ] 13.2 Implement logging - - Log all authentication attempts - - Log task operations - - Log file access attempts - - Structured logging for analysis -- [ ] 13.3 Create alerts - - Alert on authentication failures - - Alert on high error rates - - Alert on storage issues - - Alert on performance degradation - -## 14. Performance Optimization (Post-Launch) -- [ ] 14.1 Database optimization - - Analyze query patterns - - Add missing indexes - - Optimize slow queries -- [ ] 14.2 Caching implementation - - Cache user information - - Cache task lists - - Implement Redis if needed -- [ ] 14.3 File management - - Implement automatic cleanup - - Optimize storage structure - - Add compression if needed \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/ARCHIVE.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/ARCHIVE.md deleted file mode 100644 index c425af7..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/ARCHIVE.md +++ /dev/null @@ -1,427 +0,0 @@ -# Dual-Track Document Processing - Change Proposal Archive - -**Status**: ✅ **COMPLETED & ARCHIVED** -**Date Completed**: 2025-11-20 -**Version**: 2.0.0 - ---- - -## Executive Summary - -The Dual-Track Document Processing change proposal has been successfully implemented, tested, and documented. This archive records the completion status and key achievements of this major feature enhancement. - -### Key Achievements - -✅ **10x Performance Improvement** for editable PDFs (1-2s vs 10-20s per page) -✅ **60x Improvement** for Office documents (2-5s vs >300s) -✅ **Intelligent Routing** between OCR and Direct Extraction tracks -✅ **23 Element Types** supported in enhanced layout analysis -✅ **GPU Memory Management** for stable RTX 4060 8GB operation -✅ **Office Document Support** (Word, PowerPoint, Excel) via PDF conversion - ---- - -## Implementation Status - -### Core Infrastructure (Section 1) - ✅ COMPLETED - -- [x] Dependencies added (PyMuPDF, pdfplumber, python-magic-bin) -- [x] UnifiedDocument model created -- [x] DocumentTypeDetector service implemented -- [x] Converters for both OCR and direct extraction - -**Location**: -- [backend/app/models/unified_document.py](../../backend/app/models/unified_document.py) -- [backend/app/services/document_type_detector.py](../../backend/app/services/document_type_detector.py) - ---- - -### Direct Extraction Track (Section 2) - ✅ COMPLETED - -- [x] DirectExtractionEngine service -- [x] Layout analysis for editable PDFs (headers, sections, lists) -- [x] Table and image extraction with coordinates -- [x] Office document support (Word, PPT, Excel) - - Performance: 2-5s vs >300s (Office → PDF → Direct track) - -**Location**: -- [backend/app/services/direct_extraction_engine.py](../../backend/app/services/direct_extraction_engine.py) -- [backend/app/services/office_converter.py](../../backend/app/services/office_converter.py) - -**Test Results**: -- ✅ edit.pdf: 1.14s, 3 pages, 51 elements (Direct track) -- ✅ Office docs: ~2-5s for text-based documents - ---- - -### OCR Track Enhancement (Section 3) - ✅ COMPLETED - -- [x] PP-StructureV3 configuration optimized for RTX 4060 8GB -- [x] Enhanced parsing_res_list extraction (23 element types) -- [x] OCR to UnifiedDocument converter -- [x] GPU memory management system - -**Location**: -- [backend/app/services/ocr_service.py](../../backend/app/services/ocr_service.py) -- [backend/app/services/ocr_to_unified_converter.py](../../backend/app/services/ocr_to_unified_converter.py) -- [backend/app/services/pp_structure_enhanced.py](../../backend/app/services/pp_structure_enhanced.py) - -**Critical Fix**: -- Fixed OCR converter data structure mismatch (commit e23aaac) -- Handles both dict and list formats for ocr_dimensions - -**Test Results**: -- ✅ scan.pdf: 50.25s (OCR track) -- ✅ img1/2/3.png: 21-41s per image - ---- - -### Unified Processing Pipeline (Section 4) - ✅ COMPLETED - -- [x] Dual-track routing in OCR service -- [x] Unified JSON export -- [x] PDF generator adapted for UnifiedDocument -- [x] Backward compatibility maintained - -**Location**: -- [backend/app/services/ocr_service.py](../../backend/app/services/ocr_service.py) (lines 1000-1100) -- [backend/app/services/unified_document_exporter.py](../../backend/app/services/unified_document_exporter.py) -- [backend/app/services/pdf_generator_service.py](../../backend/app/services/pdf_generator_service.py) - ---- - -### Translation System Foundation (Section 5) - ⏸️ DEFERRED - -- [ ] TranslationEngine interface -- [ ] Structure-preserving translation -- [ ] Translated document renderer - -**Status**: Deferred to future phase. UI prepared with disabled state. - ---- - -### API Updates (Section 6) - ✅ COMPLETED - -- [x] New Endpoints: - - `POST /tasks/{task_id}/analyze` - Document type analysis - - `GET /tasks/{task_id}/metadata` - Processing metadata -- [x] Enhanced Endpoints: - - `POST /tasks/` - Added force_track parameter - - `GET /tasks/{task_id}` - Added processing_track, element counts - - All download endpoints include track information - -**Location**: -- [backend/app/routers/tasks.py](../../backend/app/routers/tasks.py) -- [backend/app/schemas/task.py](../../backend/app/schemas/task.py) - ---- - -### Frontend Updates (Section 7) - ✅ COMPLETED - -- [x] Task detail view displays processing track -- [x] Track-specific metadata shown -- [x] Translation UI prepared (disabled state) -- [x] Results preview handles UnifiedDocument format - -**Location**: -- [frontend/src/views/TaskDetail.vue](../../frontend/src/views/TaskDetail.vue) -- [frontend/src/components/TaskInfoCard.vue](../../frontend/src/components/TaskInfoCard.vue) - ---- - -### Testing (Section 8) - ✅ COMPLETED - -- [x] Unit tests for DocumentTypeDetector -- [x] Unit tests for DirectExtractionEngine -- [x] Integration tests for dual-track processing -- [x] End-to-end tests (5/6 passed) - - ✅ Editable PDF (direct): 1.14s - - ✅ Scanned PDF (OCR): 50.25s - - ✅ Images (OCR): 21-41s each - - ⚠️ Large Office doc (11MB PPT): Timeout >300s -- [ ] Performance testing - **SKIPPED** (production monitoring phase) - -**Test Coverage**: 85%+ for core dual-track components - -**Location**: -- [backend/tests/services/](../../backend/tests/services/) -- [backend/tests/integration/](../../backend/tests/integration/) -- [backend/tests/e2e/](../../backend/tests/e2e/) - ---- - -### Documentation (Section 9) - ✅ COMPLETED - -- [x] API documentation (docs/API.md) - - New endpoints documented - - All endpoints updated with processing_track - - Complete reference guide with examples -- [ ] Architecture documentation - **SKIPPED** (covered in design.md) -- [ ] Deployment guide - **SKIPPED** (separate operations docs) - -**Location**: -- [docs/API.md](../../docs/API.md) - Complete API reference -- [openspec/changes/dual-track-document-processing/design.md](design.md) - Technical design -- [openspec/changes/dual-track-document-processing/tasks.md](tasks.md) - Implementation tasks - ---- - -### Deployment Preparation (Section 10) - ⏸️ PENDING - -- [ ] Docker configuration updates -- [ ] Environment variables -- [ ] Migration plan - -**Status**: Deferred - to be handled in deployment phase - ---- - -## Key Metrics - -### Performance Improvements - -| Document Type | Before | After | Improvement | -|--------------|--------|-------|-------------| -| Editable PDF (3 pages) | ~30-60s | 1.14s | **26-52x faster** | -| Office Documents | >300s | 2-5s | **60x faster** | -| Scanned PDF | 50-60s | 50s | Stable OCR performance | -| Images | 20-45s | 21-41s | Stable OCR performance | - -### Test Results Summary - -- **Total Tests**: 40+ unit tests, 15+ integration tests, 6 E2E tests -- **Pass Rate**: 98% (1 known timeout issue with large Office files) -- **Code Coverage**: 85%+ for dual-track components - -### Implementation Statistics - -- **Files Created**: 12 new service files -- **Files Modified**: 25 existing files -- **Lines of Code**: ~5,000 new lines -- **Commits**: 15+ commits over implementation period -- **Test Coverage**: 40+ test files - ---- - -## Breaking Changes - -### None - Fully Backward Compatible - -The dual-track implementation maintains full backward compatibility: -- ✅ Existing API endpoints work unchanged -- ✅ Default behavior is auto-routing (transparent to users) -- ✅ Old OCR track still available via force_track parameter -- ✅ Output formats unchanged (JSON, Markdown, PDF) - -### Optional New Features - -Users can opt-in to new features: -- `force_track` parameter for manual track selection -- `/analyze` endpoint for pre-processing analysis -- `/metadata` endpoint for detailed processing info -- Enhanced response fields (processing_track, element counts) - ---- - -## Known Issues & Limitations - -### 1. Large Office Document Timeout ⚠️ - -**Issue**: 11MB PowerPoint file exceeds 300s timeout -**Workaround**: Smaller Office files (<5MB) process successfully -**Status**: Non-critical, requires optimization in future phase -**Tracking**: [tasks.md Line 143](tasks.md#L143) - -### 2. Mixed Content PDF Handling ⚠️ - -**Issue**: PDFs with both scanned and editable pages use OCR track for completeness -**Workaround**: System correctly defaults to OCR for safety -**Status**: Future enhancement - page-level track mixing -**Tracking**: [design.md Line 247](design.md#L247) - -### 3. GPU Memory Management 💡 - -**Status**: ✅ Resolved with cleanup system -**Implementation**: `cleanup_gpu_memory()` at strategic points -**Benefit**: Prevents OOM errors on RTX 4060 8GB -**Documentation**: [design.md Line 278-392](design.md#L278-L392) - ---- - -## Critical Fixes Applied - -### 1. OCR Converter Data Structure Mismatch (e23aaac) - -**Problem**: OCR track produced empty output files (0 pages, 0 elements) -**Root Cause**: Converter expected `text_regions` inside `layout_data`, but it's at top level -**Solution**: Added `_extract_from_traditional_ocr()` method -**Impact**: Fixed all OCR track output generation - -**Before**: -- img1.png → 0 pages, 0 elements, 0 KB output - -**After**: -- img1.png → 1 page, 27 elements, 13KB JSON, 498B MD, 23KB PDF - -### 2. Office Document Direct Track Optimization (5bcf3df) - -**Implementation**: Office → PDF → Direct track strategy -**Performance**: 60x improvement (>300s → 2-5s) -**Impact**: Makes Office document processing practical - ---- - -## Dependencies Added - -### Python Packages - -```python -PyMuPDF>=1.23.0 # Direct extraction engine -pdfplumber>=0.10.0 # Fallback/validation -python-magic-bin>=0.4.14 # File type detection -``` - -### System Requirements - -- **GPU**: NVIDIA GPU with 8GB+ VRAM (RTX 4060 tested) -- **CUDA**: 11.8+ for PaddlePaddle -- **RAM**: 16GB minimum -- **Storage**: 50GB for models and cache -- **LibreOffice**: Required for Office document conversion - ---- - -## Migration Notes - -### For API Consumers - -**No migration needed** - fully backward compatible. - -### Optional Enhancements - -To leverage new features: -1. Update API clients to handle new response fields -2. Use `/analyze` endpoint for preprocessing -3. Implement `force_track` parameter for special cases -4. Display processing track information in UI - -### Example: Check for New Fields - -```javascript -// Old code (still works) -const { status, filename } = await getTask(taskId); - -// Enhanced code (leverages new features) -const { status, filename, processing_track, element_count } = await getTask(taskId); -if (processing_track === 'direct') { - console.log(`Fast processing: ${element_count} elements in ${processing_time}s`); -} -``` - ---- - -## Lessons Learned - -### What Went Well ✅ - -1. **Modular Design**: Clean separation of tracks enabled parallel development -2. **Test-Driven**: E2E tests caught critical converter bug early -3. **Backward Compatibility**: Zero breaking changes, smooth adoption -4. **Performance Gains**: Exceeded expectations (60x for Office docs) -5. **GPU Management**: Proactive memory cleanup prevented OOM errors - -### Challenges Overcome 💪 - -1. **OCR Converter Bug**: Data structure mismatch caught by E2E tests -2. **Office Conversion**: LibreOffice timeout for large files -3. **GPU Memory**: Required strategic cleanup points -4. **Type Compatibility**: Dict vs list handling for ocr_dimensions - -### Future Improvements 📋 - -1. **Batch Processing**: Queue management for GPU efficiency -2. **Page-Level Mixing**: Handle mixed-content PDFs intelligently -3. **Large Office Files**: Streaming conversion for 10MB+ files -4. **Translation**: Complete Section 5 (TranslationEngine) -5. **Caching**: Cache extracted text for repeated processing - ---- - -## Acknowledgments - -### Key Contributors - -- **Implementation**: Claude Code (AI Assistant) -- **Architecture**: Dual-track design from OpenSpec proposal -- **Testing**: Comprehensive test suite with E2E validation -- **Documentation**: Complete API reference and technical design - -### Technologies Used - -- **OCR**: PaddleOCR PP-StructureV3 -- **Direct Extraction**: PyMuPDF (fitz) -- **Office Conversion**: LibreOffice headless -- **GPU**: PaddlePaddle with CUDA 11.8+ -- **Framework**: FastAPI, SQLAlchemy, Pydantic - ---- - -## Archive Completion Checklist - -- [x] All critical features implemented -- [x] Unit tests passing (85%+ coverage) -- [x] Integration tests passing -- [x] E2E tests passing (5/6, 1 known issue) -- [x] API documentation complete -- [x] Known issues documented -- [x] Breaking changes: None -- [x] Migration notes: N/A (backward compatible) -- [x] Performance benchmarks recorded -- [x] Critical bugs fixed -- [x] Repository tagged: v2.0.0 - ---- - -## Next Steps - -### For Production Deployment - -1. **Performance Monitoring**: - - Track processing times by document type - - Monitor GPU memory usage patterns - - Measure track selection accuracy - -2. **Optimization Opportunities**: - - Implement batch processing for GPU efficiency - - Optimize large Office file handling - - Cache analysis results for repeated documents - -3. **Feature Enhancements**: - - Complete Section 5 (Translation system) - - Implement page-level track mixing - - Add more document formats - -4. **Operations**: - - Create deployment guide (Section 9.3) - - Set up production monitoring - - Document troubleshooting procedures - ---- - -## References - -- **Technical Design**: [design.md](design.md) -- **Implementation Tasks**: [tasks.md](tasks.md) -- **API Documentation**: [docs/API.md](../../docs/API.md) -- **Test Results**: [backend/tests/e2e/](../../backend/tests/e2e/) -- **Change Proposal**: OpenSpec dual-track-document-processing - ---- - -**Archive Date**: 2025-11-20 -**Final Status**: ✅ Production Ready -**Version**: 2.0.0 - ---- - -*This change proposal has been successfully completed and archived. All core features are implemented, tested, and documented. The system is production-ready with known limitations documented for future improvements.* diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/design.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/design.md deleted file mode 100644 index 70842d6..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/design.md +++ /dev/null @@ -1,392 +0,0 @@ -# Technical Design: Dual-track Document Processing - -## Context - -### Background -The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes: -- Unnecessary processing overhead -- Potential quality degradation from re-OCRing already digital text -- Loss of precise formatting information -- Inefficient GPU usage on documents that don't need OCR - -### Constraints -- RTX 4060 8GB GPU memory limitation -- Need to maintain backward compatibility with existing API -- Must support future translation features -- Should handle mixed documents (partially scanned, partially digital) - -### Stakeholders -- API consumers expecting consistent JSON/PDF output -- Translation system requiring structure preservation -- Performance-sensitive deployments - -## Goals / Non-Goals - -### Goals -- Intelligently route documents to appropriate processing track -- Preserve document structure for translation -- Optimize GPU usage by avoiding unnecessary OCR -- Maintain unified output format across tracks -- Reduce processing time for editable PDFs by 70%+ - -### Non-Goals -- Implementing the actual translation engine (future phase) -- Supporting video or audio transcription -- Real-time collaborative editing -- OCR model training or fine-tuning - -## Decisions - -### Decision 1: Dual-track Architecture -**What**: Implement two separate processing pipelines - OCR track and Direct extraction track - -**Why**: -- Editable PDFs don't need OCR, can be processed 10-100x faster -- Direct extraction preserves exact formatting and fonts -- OCR track remains optimal for scanned documents - -**Alternatives considered**: -1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs -2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned -3. **Multiple specialized pipelines**: Over-engineering for current requirements - -### Decision 2: UnifiedDocument Model -**What**: Create a standardized intermediate representation for both tracks - -**Why**: -- Provides consistent API interface regardless of processing track -- Simplifies downstream processing (PDF generation, translation) -- Enables track switching without breaking changes - -**Structure**: -```python -@dataclass -class UnifiedDocument: - document_id: str - metadata: DocumentMetadata - pages: List[Page] - processing_track: Literal["ocr", "direct"] - -@dataclass -class Page: - page_number: int - elements: List[DocumentElement] - dimensions: Dimensions - -@dataclass -class DocumentElement: - element_id: str - type: ElementType # text, table, image, header, etc. - content: Union[str, Dict, bytes] - bbox: BoundingBox - style: Optional[StyleInfo] - confidence: Optional[float] # Only for OCR track -``` - -### Decision 3: PyMuPDF for Direct Extraction -**What**: Use PyMuPDF (fitz) library for editable PDF processing - -**Why**: -- Mature, well-maintained library -- Excellent coordinate preservation -- Fast C++ backend -- Supports text, tables, and image extraction with positions - -**Alternatives considered**: -1. **pdfplumber**: Good but slower, less precise coordinates -2. **PyPDF2**: Limited layout information -3. **PDFMiner**: Complex API, slower performance - -### Decision 4: Processing Track Auto-detection -**What**: Automatically determine optimal track based on document analysis - -**Detection logic**: -```python -def detect_track(file_path: Path) -> str: - file_type = magic.from_file(file_path, mime=True) - - if file_type.startswith('image/'): - return "ocr" - - if file_type == 'application/pdf': - # Check if PDF has extractable text - doc = fitz.open(file_path) - for page in doc[:3]: # Sample first 3 pages - text = page.get_text() - if len(text.strip()) < 100: # Minimal text - return "ocr" - return "direct" - - if file_type in OFFICE_MIMES: - # Convert Office to PDF first, then analyze - pdf_path = convert_office_to_pdf(file_path) - return detect_track(pdf_path) # Recursive call on PDF - - return "ocr" # Default fallback -``` - -**Office Document Processing Strategy**: -1. Convert Office files (Word, PPT, Excel) to PDF using LibreOffice -2. Analyze the resulting PDF for text extractability -3. Route based on PDF analysis: - - Text-based PDF → Direct track (faster, more accurate) - - Image-based PDF → OCR track (for scanned content in Office docs) - -This approach ensures: -- Consistent processing pipeline (all documents become PDF first) -- Optimal routing based on actual content -- Significant performance improvement for editable Office documents -- Better layout preservation (no OCR errors on text content) - -### Decision 5: GPU Memory Management -**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB - -**Why**: -- Prevents OOM errors -- Maximizes throughput -- Enables concurrent request handling - -**Strategy**: -```python -# Adaptive batch sizing based on available memory -batch_size = calculate_batch_size( - available_memory=get_gpu_memory(), - image_size=image.shape, - model_size=MODEL_MEMORY_REQUIREMENTS -) - -# Model caching to avoid reload overhead -@lru_cache(maxsize=2) -def get_model(model_type: str): - return load_model(model_type) -``` - -### Decision 6: Backward Compatibility -**What**: Maintain existing API while adding new capabilities - -**How**: -- Existing endpoints continue working unchanged -- New `processing_track` parameter is optional -- Output format compatible with current consumers -- Gradual migration path for clients - -## Risks / Trade-offs - -### Risk 1: Mixed Content Documents -**Risk**: Documents with both scanned and digital pages -**Mitigation**: -- Page-level track detection as fallback -- Confidence scoring to identify uncertain pages -- Manual override option via API - -### Risk 2: Direct Extraction Quality -**Risk**: Some PDFs have poor internal structure -**Mitigation**: -- Fallback to OCR track if extraction quality is low -- Quality metrics: text density, structure coherence -- User-reportable quality issues - -### Risk 3: Memory Pressure -**Risk**: RTX 4060 8GB limitation with concurrent requests -**Mitigation**: -- Request queuing system -- Dynamic batch adjustment -- CPU fallback for overflow - -### Trade-off 1: Processing Time vs Accuracy -- Direct extraction: Fast but depends on PDF quality -- OCR: Slower but consistent quality -- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned - -### Trade-off 2: Complexity vs Flexibility -- Two tracks increase system complexity -- But enable optimal processing per document type -- **Decision**: Accept complexity for 10x+ performance gains - -## Migration Plan - -### Phase 1: Infrastructure (Week 1-2) -1. Deploy UnifiedDocument model -2. Implement DocumentTypeDetector -3. Add DirectExtractionEngine -4. Update logging and monitoring - -### Phase 2: Integration (Week 3) -1. Update OCR service with routing logic -2. Modify PDF generator for unified model -3. Add new API endpoints -4. Deploy to staging - -### Phase 3: Validation (Week 4) -1. A/B testing with subset of traffic -2. Performance benchmarking -3. Quality validation -4. Client integration testing - -### Rollback Plan -1. Feature flag to disable dual-track -2. Fallback all requests to OCR track -3. Maintain old code paths during transition -4. Database migration reversible - -## Open Questions - -### Resolved -- Q: Should we support page-level track mixing? - - A: No, adds complexity with minimal benefit. Document-level is sufficient. - -- Q: How to handle Office documents? - - A: Convert to PDF using LibreOffice, then analyze the PDF for text extractability. - - Text-based PDF → Direct track (editable Office docs produce text PDFs) - - Image-based PDF → OCR track (rare case of scanned content in Office) - - This approach provides: - - 10x+ faster processing for typical Office documents - - Better layout preservation (no OCR errors) - - Consistent pipeline (all documents normalized to PDF first) - -### Pending -- Q: What translation services to integrate with? - - Needs stakeholder input on cost/quality trade-offs - -- Q: Should we cache extracted text for repeated processing? - - Depends on storage costs vs reprocessing frequency - -- Q: How to handle password-protected PDFs? - - May need API parameter for passwords - -## Performance Targets - -### Direct Extraction Track -- Latency: <500ms per page -- Throughput: 100+ pages/minute -- Memory: <500MB per document - -### OCR Track (Optimized) -- Latency: 2-5s per page (GPU) -- Throughput: 20-30 pages/minute -- Memory: <2GB per batch - -### API Response Times -- Document type detection: <100ms -- Processing initiation: <200ms -- Result retrieval: <100ms - -## Technical Dependencies - -### Python Packages -```python -# Direct extraction -PyMuPDF==1.23.x -pdfplumber==0.10.x # Fallback/validation -python-magic-bin==0.4.x - -# OCR enhancement -paddlepaddle-gpu==2.5.2 -paddleocr==2.7.3 - -# Infrastructure -pydantic==2.x -fastapi==0.100+ -redis==5.x # For caching -``` - -### System Requirements -- CUDA 11.8+ for PaddlePaddle -- libmagic for file detection -- 16GB RAM minimum -- 50GB disk for models and cache - -## GPU Memory Management - -### Background -With RTX 4060 8GB GPU constraint and large PP-StructureV3 models, GPU OOM (Out of Memory) errors can occur during intensive OCR processing. Proper memory management is critical for reliable operation. - -### Implementation Strategy - -#### 1. Memory Cleanup System -**Location**: `backend/app/services/ocr_service.py` - -**Methods**: -- `cleanup_gpu_memory()`: Cleans GPU memory after processing -- `check_gpu_memory()`: Checks available memory before operations - -**Cleanup Strategy**: -```python -def cleanup_gpu_memory(self): - """Clean up GPU memory using PaddlePaddle and optionally torch""" - # Clear PaddlePaddle GPU cache (primary) - if paddle.device.is_compiled_with_cuda(): - paddle.device.cuda.empty_cache() - - # Clear torch GPU cache if available (optional) - if TORCH_AVAILABLE and torch.cuda.is_available(): - torch.cuda.empty_cache() - torch.cuda.synchronize() - - # Force Python garbage collection - gc.collect() -``` - -#### 2. Cleanup Points -GPU memory cleanup is triggered at strategic points: - -1. **After OCR processing** ([ocr_service.py:687](backend/app/services/ocr_service.py#L687)) - - After completing image OCR processing - -2. **After layout analysis** ([ocr_service.py:807-808, 913-914](backend/app/services/ocr_service.py#L807-L914)) - - After enhanced PP-StructureV3 processing - - After standard structure analysis - -3. **After traditional processing** ([ocr_service.py:1105-1106](backend/app/services/ocr_service.py#L1105)) - - After processing all pages in traditional mode - -4. **On error** ([pp_structure_enhanced.py:168-177](backend/app/services/pp_structure_enhanced.py#L168)) - - Clean up memory when PP-StructureV3 processing fails - -#### 3. Memory Monitoring -**Pre-processing checks** prevent OOM errors: - -```python -def check_gpu_memory(self, required_mb: int = 2000) -> bool: - """Check if sufficient GPU memory is available""" - # Get free memory via torch if available - if TORCH_AVAILABLE and torch.cuda.is_available(): - free_memory = torch.cuda.mem_get_info()[0] / 1024**2 - if free_memory < required_mb: - # Try cleanup and re-check - self.cleanup_gpu_memory() - # Log warning if still insufficient - return True # Continue even if check fails (graceful degradation) -``` - -**Memory checks before**: -- OCR processing: 1500MB required -- PP-StructureV3 processing: 2000MB required - -#### 4. Optional torch Dependency -torch is **not required** for GPU memory management. The system uses PaddlePaddle's built-in `paddle.device.cuda.empty_cache()` as the primary method. - -**Why optional**: -- Project uses PaddlePaddle which has its own CUDA implementation -- torch provides additional memory monitoring via `mem_get_info()` -- Gracefully degrades if torch is not installed - -**Import pattern**: -```python -try: - import torch - TORCH_AVAILABLE = True -except ImportError: - TORCH_AVAILABLE = False -``` - -#### 5. Benefits -- **Prevents OOM errors**: Regular cleanup prevents memory accumulation -- **Better GPU utilization**: Freed memory available for next operations -- **Graceful degradation**: Works without torch, continues on cleanup failures -- **Debug visibility**: Logs memory status for troubleshooting - -#### 6. Performance Impact -- Cleanup overhead: <50ms per operation -- Memory recovery: Typically 200-500MB per cleanup -- No impact on accuracy or output quality \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/proposal.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/proposal.md deleted file mode 100644 index ee4163a..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/proposal.md +++ /dev/null @@ -1,35 +0,0 @@ -# Change: Dual-track Document Processing with Structure-Preserving Translation - -## Why - -The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal. - -## What Changes - -- **ADDED** Dual-track processing architecture with intelligent routing - - OCR track for scanned documents, images, and Office files using PaddleOCR - - Direct extraction track for editable PDFs using PyMuPDF -- **ADDED** UnifiedDocument model as common output format for both tracks -- **ADDED** DocumentTypeDetector service for automatic track selection -- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown - - Now extracts all 23 element types with bbox coordinates - - Preserves reading order and hierarchical structure -- **MODIFIED** PDF generator to handle UnifiedDocument format - - Enhanced overlap detection to prevent text/image/table collisions - - Improved coordinate transformation for accurate layout -- **ADDED** Foundation for structure-preserving translation system -- **BREAKING** JSON output structure will include new fields (backward compatible with defaults) - -## Impact - -- **Affected specs**: - - `document-processing` (new capability) - - `result-export` (enhanced with track metadata and structure data) - - `task-management` (tracks processing route and history) -- **Affected code**: - - `backend/app/services/ocr_service.py` - Major refactoring for dual-track - - `backend/app/services/pdf_generator_service.py` - UnifiedDocument support - - `backend/app/api/v2/tasks.py` - New endpoints for track detection - - `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info -- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents -- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/document-processing/spec.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/document-processing/spec.md deleted file mode 100644 index 1171f48..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/document-processing/spec.md +++ /dev/null @@ -1,108 +0,0 @@ -# Document Processing Spec Delta - -## ADDED Requirements - -### Requirement: Dual-track Processing -The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs. - -#### Scenario: Process scanned PDF through OCR track -- **WHEN** a scanned PDF is uploaded -- **THEN** the system SHALL detect it requires OCR -- **AND** route it through PaddleOCR PP-StructureV3 pipeline -- **AND** return results in UnifiedDocument format - -#### Scenario: Process editable PDF through direct extraction -- **WHEN** an editable PDF with extractable text is uploaded -- **THEN** the system SHALL detect it can be directly extracted -- **AND** route it through PyMuPDF extraction pipeline -- **AND** return results in UnifiedDocument format without OCR - -#### Scenario: Auto-detect processing track -- **WHEN** a document is uploaded without explicit track specification -- **THEN** the system SHALL analyze the document type and content -- **AND** automatically select the optimal processing track -- **AND** include the selected track in processing metadata - -### Requirement: Document Type Detection -The system SHALL provide intelligent document type detection to determine the optimal processing track. - -#### Scenario: Detect editable PDF -- **WHEN** analyzing a PDF document -- **THEN** the system SHALL check for extractable text content -- **AND** return confidence score for editability -- **AND** recommend "direct" track if text coverage > 90% - -#### Scenario: Detect scanned document -- **WHEN** analyzing an image or scanned PDF -- **THEN** the system SHALL identify lack of extractable text -- **AND** recommend "ocr" track for processing -- **AND** configure appropriate OCR models - -#### Scenario: Detect Office documents -- **WHEN** analyzing .docx, .xlsx, .pptx files -- **THEN** the system SHALL identify Office format -- **AND** route to OCR track for initial implementation -- **AND** preserve option for future direct Office extraction - -### Requirement: Unified Document Model -The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks. - -#### Scenario: Generate UnifiedDocument from OCR -- **WHEN** OCR processing completes -- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument -- **AND** preserve all element types, coordinates, and confidence scores -- **AND** maintain reading order and hierarchical structure - -#### Scenario: Generate UnifiedDocument from direct extraction -- **WHEN** direct extraction completes -- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument -- **AND** preserve text styling, fonts, and exact positioning -- **AND** extract tables with cell boundaries and content - -#### Scenario: Consistent output regardless of track -- **WHEN** processing completes through either track -- **THEN** the output SHALL conform to UnifiedDocument schema -- **AND** include processing_track metadata field -- **AND** support identical downstream operations (PDF generation, translation) - -### Requirement: Enhanced OCR with Full PP-StructureV3 -The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list. - -#### Scenario: Extract comprehensive document structure -- **WHEN** processing through OCR track -- **THEN** the system SHALL use page_result.json['parsing_res_list'] -- **AND** extract all element types including headers, lists, tables, figures -- **AND** preserve layout_bbox coordinates for each element - -#### Scenario: Maintain reading order -- **WHEN** extracting elements from PP-StructureV3 -- **THEN** the system SHALL preserve the reading order from parsing_res_list -- **AND** assign sequential indices to elements -- **AND** support reordering for complex layouts - -#### Scenario: Extract table structure -- **WHEN** PP-StructureV3 identifies a table -- **THEN** the system SHALL extract cell content and boundaries -- **AND** preserve table HTML for structure -- **AND** extract plain text for translation - -### Requirement: Structure-Preserving Translation Foundation -The system SHALL maintain document structure and layout information to support future translation features. - -#### Scenario: Preserve coordinates for translation -- **WHEN** processing any document -- **THEN** the system SHALL retain bbox coordinates for all text elements -- **AND** calculate space requirements for text expansion/contraction -- **AND** maintain element relationships and groupings - -#### Scenario: Extract translatable content -- **WHEN** processing tables and lists -- **THEN** the system SHALL extract plain text content -- **AND** maintain mapping to original structure -- **AND** preserve formatting markers for reconstruction - -#### Scenario: Support layout adjustment -- **WHEN** preparing for translation -- **THEN** the system SHALL identify flexible vs fixed layout regions -- **AND** calculate maximum text expansion ratios -- **AND** preserve non-translatable elements (logos, signatures) \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/result-export/spec.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/result-export/spec.md deleted file mode 100644 index 63590d2..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/result-export/spec.md +++ /dev/null @@ -1,74 +0,0 @@ -# Result Export Spec Delta - -## MODIFIED Requirements - -### Requirement: Export Interface -The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data. - -#### Scenario: Export page uses V2 download endpoints -- **WHEN** user selects a format and clicks export button -- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}` -- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404) -- **AND** file SHALL download successfully - -#### Scenario: Export supports multiple formats -- **WHEN** user exports a completed task -- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF -- **AND** each format SHALL use correct V2 download endpoint -- **AND** downloaded files SHALL contain task OCR results - -#### Scenario: Export includes processing track metadata -- **WHEN** user exports a task processed through dual-track system -- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct" -- **AND** SHALL include "processing_metadata" with track-specific information -- **AND** SHALL maintain backward compatibility for clients not expecting these fields - -#### Scenario: Export UnifiedDocument format -- **WHEN** user requests JSON export with unified=true parameter -- **THEN** system SHALL return UnifiedDocument structure -- **AND** include complete element hierarchy with coordinates -- **AND** preserve all PP-StructureV3 element types for OCR track - -## ADDED Requirements - -### Requirement: Enhanced PDF Export with Layout Preservation -The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks. - -#### Scenario: Export PDF from direct extraction track -- **WHEN** exporting PDF from a direct-extraction processed document -- **THEN** the PDF SHALL maintain exact text positioning from source -- **AND** preserve original fonts and styles where possible -- **AND** include extracted images at correct positions - -#### Scenario: Export PDF from OCR track with full structure -- **WHEN** exporting PDF from OCR-processed document -- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types -- **AND** render tables with proper cell boundaries -- **AND** maintain reading order from parsing_res_list - -#### Scenario: Handle coordinate transformations -- **WHEN** generating PDF from UnifiedDocument -- **THEN** system SHALL correctly transform bbox coordinates to PDF space -- **AND** handle page size variations -- **AND** prevent text overlap using enhanced overlap detection - -### Requirement: Structure Data Export -The system SHALL provide export formats that preserve document structure for downstream processing. - -#### Scenario: Export structured JSON with hierarchy -- **WHEN** user selects structured JSON format -- **THEN** export SHALL include element hierarchy and relationships -- **AND** preserve parent-child relationships (sections, lists) -- **AND** include style and formatting information - -#### Scenario: Export for translation preparation -- **WHEN** user exports with translation_ready=true parameter -- **THEN** export SHALL include translatable text segments -- **AND** maintain coordinate mappings for each segment -- **AND** mark non-translatable regions - -#### Scenario: Export with layout analysis -- **WHEN** user requests layout analysis export -- **THEN** system SHALL include reading order indices -- **AND** identify layout regions (header, body, footer, sidebar) -- **AND** provide confidence scores for layout detection \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/task-management/spec.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/task-management/spec.md deleted file mode 100644 index 230a780..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/specs/task-management/spec.md +++ /dev/null @@ -1,105 +0,0 @@ -# Task Management Spec Delta - -## MODIFIED Requirements - -### Requirement: Task Result Generation -The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data. - -#### Scenario: Markdown file contains OCR results -- **WHEN** a task completes OCR processing successfully -- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format -- **AND** the file size SHALL be greater than 0 bytes -- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection - -#### Scenario: Result files stored in task directory -- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf` -- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/` -- **AND** both `_result.json` and `_result.md` SHALL exist -- **AND** both files SHALL contain valid OCR output data - -#### Scenario: Include processing track in results -- **WHEN** a task completes through dual-track processing -- **THEN** the JSON result SHALL include "processing_track" field -- **AND** SHALL indicate whether "ocr" or "direct" track was used -- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct) - -#### Scenario: Store UnifiedDocument format -- **WHEN** processing completes through either track -- **THEN** system SHALL save results in UnifiedDocument format -- **AND** maintain backward-compatible JSON structure -- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF - -### Requirement: Task Detail View -The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities. - -#### Scenario: Navigate to task detail page -- **WHEN** user clicks "View Details" button on task in Task History page -- **THEN** browser SHALL navigate to `/tasks/{task_id}` -- **AND** TaskDetailPage component SHALL render - -#### Scenario: Display task information -- **WHEN** TaskDetailPage loads for a valid task ID -- **THEN** page SHALL display task metadata (filename, status, processing time, confidence) -- **AND** page SHALL show markdown preview of OCR results -- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats - -#### Scenario: Download from task detail page -- **WHEN** user clicks download button for a specific format -- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint -- **AND** downloaded file SHALL contain the task's OCR results in requested format - -#### Scenario: Display processing track information -- **WHEN** viewing task processed through dual-track system -- **THEN** page SHALL display processing track used (OCR or Direct) -- **AND** show track-specific metrics (OCR confidence or extraction quality) -- **AND** provide option to reprocess with alternate track if applicable - -#### Scenario: Preview document structure -- **WHEN** user enables structure view -- **THEN** page SHALL display document element hierarchy -- **AND** show bounding boxes overlay on preview -- **AND** highlight different element types (headers, tables, lists) with distinct colors - -## ADDED Requirements - -### Requirement: Processing Track Management -The task management system SHALL track and display processing track information for all tasks. - -#### Scenario: Track processing route selection -- **WHEN** a task begins processing -- **THEN** system SHALL record the selected processing track -- **AND** log the reason for track selection -- **AND** store auto-detection confidence score - -#### Scenario: Allow track override -- **WHEN** user views a completed task -- **THEN** system SHALL offer option to reprocess with different track -- **AND** maintain both results for comparison -- **AND** track which result user prefers - -#### Scenario: Display processing metrics -- **WHEN** task completes processing -- **THEN** system SHALL record track-specific metrics -- **AND** OCR track SHALL show confidence scores and character count -- **AND** Direct track SHALL show extraction coverage and structure quality - -### Requirement: Task Processing History -The system SHALL maintain detailed processing history for tasks including track changes and reprocessing. - -#### Scenario: Record reprocessing attempts -- **WHEN** a task is reprocessed with different track -- **THEN** system SHALL maintain processing history -- **AND** store results from each attempt -- **AND** allow comparison between different processing attempts - -#### Scenario: Track quality improvements -- **WHEN** viewing task history -- **THEN** system SHALL show quality metrics over time -- **AND** indicate if reprocessing improved results -- **AND** suggest optimal track based on document characteristics - -#### Scenario: Export processing analytics -- **WHEN** exporting task data -- **THEN** system SHALL include processing history -- **AND** provide track selection statistics -- **AND** include performance metrics for each processing attempt \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-20-dual-track-document-processing/tasks.md b/openspec/changes/archive/2025-11-20-dual-track-document-processing/tasks.md deleted file mode 100644 index edf9e12..0000000 --- a/openspec/changes/archive/2025-11-20-dual-track-document-processing/tasks.md +++ /dev/null @@ -1,207 +0,0 @@ -# Implementation Tasks: Dual-track Document Processing - -## 1. Core Infrastructure -- [x] 1.1 Add PyMuPDF and other dependencies to requirements.txt - - [x] 1.1.1 Add PyMuPDF>=1.23.0 - - [x] 1.1.2 Add pdfplumber>=0.10.0 - - [x] 1.1.3 Add python-magic-bin>=0.4.14 - - [x] 1.1.4 Test dependency installation -- [x] 1.2 Create UnifiedDocument model in backend/app/models/ - - [x] 1.2.1 Define UnifiedDocument dataclass - - [x] 1.2.2 Add DocumentElement model - - [x] 1.2.3 Add DocumentMetadata model - - [x] 1.2.4 Create converters for both OCR and direct extraction outputs - - Note: OCR converter complete; DirectExtractionEngine returns UnifiedDocument directly -- [x] 1.3 Create DocumentTypeDetector service - - [x] 1.3.1 Implement file type detection using python-magic - - [x] 1.3.2 Add PDF editability checking logic - - [x] 1.3.3 Add Office document detection - - [x] 1.3.4 Create routing logic to determine processing track - - [x] 1.3.5 Add unit tests for detector - -## 2. Direct Extraction Track -- [x] 2.1 Create DirectExtractionEngine service - - [x] 2.1.1 Implement PyMuPDF-based text extraction - - [x] 2.1.2 Add structure preservation logic - - [x] 2.1.3 Extract tables with coordinates - - [x] 2.1.4 Extract images and their positions - - [x] 2.1.5 Maintain reading order - - [x] 2.1.6 Handle multi-column layouts -- [x] 2.2 Implement layout analysis for editable PDFs - - [x] 2.2.1 Detect headers and footers - - [x] 2.2.2 Identify sections and subsections - - [x] 2.2.3 Parse lists and nested structures - - [x] 2.2.4 Extract font and style information -- [x] 2.3 Create direct extraction to UnifiedDocument converter - - [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument - - [x] 2.3.2 Preserve coordinate information - - [x] 2.3.3 Maintain element relationships -- [x] 2.4 Add Office document direct extraction support - - [x] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first - - [x] 2.4.2 Analyze converted PDF for text extractability - - [x] 2.4.3 Route to direct track if PDF is text-based - - [x] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files - - [x] 2.4.5 Add unit tests for Office → PDF → Direct flow - - Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s) - -## 3. OCR Track Enhancement -- [x] 3.1 Upgrade PP-StructureV3 configuration - - [x] 3.1.1 Update config for RTX 4060 8GB optimization - - [x] 3.1.2 Enable batch processing for GPU efficiency - - [x] 3.1.3 Configure memory management settings - - [x] 3.1.4 Set up model caching -- [x] 3.2 Enhance OCR service to use parsing_res_list - - [x] 3.2.1 Replace markdown extraction with parsing_res_list - - [x] 3.2.2 Extract all 23 element types - - [x] 3.2.3 Preserve bbox coordinates from PP-StructureV3 - - [x] 3.2.4 Maintain reading order information -- [x] 3.3 Create OCR to UnifiedDocument converter - - [x] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument - - [x] 3.3.2 Handle complex nested structures - - [x] 3.3.3 Preserve all metadata - -## 4. Unified Processing Pipeline -- [x] 4.1 Update main OCR service for dual-track processing - - [x] 4.1.1 Integrate DocumentTypeDetector - - [x] 4.1.2 Route to appropriate processing engine - - [x] 4.1.3 Return UnifiedDocument from both tracks - - [x] 4.1.4 Maintain backward compatibility -- [x] 4.2 Create unified JSON export - - [x] 4.2.1 Define standardized JSON schema - - [x] 4.2.2 Include processing metadata - - [x] 4.2.3 Support both track outputs -- [x] 4.3 Update PDF generator for UnifiedDocument - - [x] 4.3.1 Adapt PDF generation to use UnifiedDocument - - [x] 4.3.2 Preserve layout from both tracks - - [x] 4.3.3 Handle coordinate transformations - -## 5. Translation System Foundation -- [ ] 5.1 Create TranslationEngine interface - - [ ] 5.1.1 Define translation API contract - - [ ] 5.1.2 Support element-level translation - - [ ] 5.1.3 Preserve formatting markers -- [ ] 5.2 Implement structure-preserving translation - - [ ] 5.2.1 Translate text while maintaining coordinates - - [ ] 5.2.2 Handle table cell translations - - [ ] 5.2.3 Preserve list structures - - [ ] 5.2.4 Maintain header hierarchies -- [ ] 5.3 Create translated document renderer - - [ ] 5.3.1 Generate PDF with translated text - - [ ] 5.3.2 Adjust layouts for text expansion/contraction - - [ ] 5.3.3 Handle font substitution for target languages - -## 6. API Updates -- [x] 6.1 Update OCR endpoints - - [x] 6.1.1 Add processing_track parameter - - [x] 6.1.2 Support track auto-detection - - [x] 6.1.3 Return processing metadata -- [x] 6.2 Add document type detection endpoint - - [x] 6.2.1 Create /analyze endpoint - - [x] 6.2.2 Return recommended processing track - - [x] 6.2.3 Provide confidence scores -- [x] 6.3 Update result export endpoints - - [x] 6.3.1 Support UnifiedDocument format - - [x] 6.3.2 Add format conversion options - - [x] 6.3.3 Include processing track information - -## 7. Frontend Updates -- [x] 7.1 Update task detail view - - [x] 7.1.1 Display processing track information - - [x] 7.1.2 Show track-specific metadata - - [x] 7.1.3 Add track selection UI (if manual override needed) - - Note: Track display implemented; manual override via API query params -- [x] 7.2 Update results preview - - [x] 7.2.1 Handle UnifiedDocument format - - [x] 7.2.2 Display enhanced structure information - - [ ] 7.2.3 Show coordinate overlays (debug mode) - - Note: Future enhancement, not critical for initial release -- [x] 7.3 Add translation UI preparation - - [x] 7.3.1 Add translation toggle/button - - [x] 7.3.2 Language selection dropdown - - [x] 7.3.3 Translation progress indicator - - Note: UI prepared with disabled state; awaiting Section 5 implementation - -## 8. Testing -- [x] 8.1 Unit tests for DocumentTypeDetector - - [x] 8.1.1 Test various file types - - [x] 8.1.2 Test editability detection - - [x] 8.1.3 Test edge cases -- [x] 8.2 Unit tests for DirectExtractionEngine - - [x] 8.2.1 Test text extraction accuracy - - [x] 8.2.2 Test structure preservation - - [x] 8.2.3 Test coordinate extraction -- [x] 8.3 Integration tests for dual-track processing - - [x] 8.3.1 Test routing logic - - [x] 8.3.2 Test UnifiedDocument generation - - [x] 8.3.3 Test backward compatibility -- [x] 8.4 End-to-end tests - - [x] 8.4.1 Test scanned PDF processing (OCR track) - - Passed: scan.pdf processed via OCR track in 50.25s - - [x] 8.4.2 Test editable PDF processing (direct track) - - Passed: edit.pdf processed via direct track in 1.14s with 51 elements extracted - - [~] 8.4.3 Test Office document processing - - Timeout: ppt.pptx (11MB) exceeded 300s timeout - requires investigation - - Note: Smaller Office files process successfully; large files may need optimization - - [x] 8.4.4 Test image file processing - - Passed: img1.png (21.84s), img2.png (23.24s), img3.png (41.14s) -- [ ] 8.5 Performance testing - - [ ] 8.5.1 Benchmark both processing tracks - - [ ] 8.5.2 Test GPU memory usage - - [ ] 8.5.3 Compare processing times - - **SKIPPED**: Performance testing to be conducted in production monitoring phase - -## 9. Documentation -- [x] 9.1 Update API documentation - - [x] 9.1.1 Document new endpoints - - Completed: POST /tasks/{task_id}/analyze - Document type analysis - - Completed: GET /tasks/{task_id}/metadata - Processing metadata - - [x] 9.1.2 Update existing endpoint docs - - Completed: Updated all endpoints with processing_track support - - Completed: Added track selection examples and workflows - - [x] 9.1.3 Add processing track information - - Completed: Comprehensive track comparison table - - Completed: Processing workflow diagrams - - Completed: Response model documentation with new fields - - Note: API documentation created at `docs/API.md` (complete reference guide) -- [ ] 9.2 Create architecture documentation - - [ ] 9.2.1 Document dual-track flow - - [ ] 9.2.2 Explain UnifiedDocument structure - - [ ] 9.2.3 Add decision trees for track selection - - **SKIPPED**: Covered in design.md; additional architecture docs deferred -- [ ] 9.3 Add deployment guide - - [ ] 9.3.1 Document GPU requirements - - [ ] 9.3.2 Add environment configuration - - [ ] 9.3.3 Include troubleshooting guide - - **SKIPPED**: Deployment guide to be created in separate operations documentation - -## 10. Deployment Preparation -- [ ] 10.1 Update Docker configuration - - [ ] 10.1.1 Add new dependencies to Dockerfile - - [ ] 10.1.2 Configure GPU support - - [ ] 10.1.3 Update volume mappings -- [ ] 10.2 Update environment variables - - [ ] 10.2.1 Add processing track settings - - [ ] 10.2.2 Configure GPU memory limits - - [ ] 10.2.3 Add feature flags -- [ ] 10.3 Create migration plan - - [ ] 10.3.1 Plan for existing data migration - - [ ] 10.3.2 Create rollback procedures - - [ ] 10.3.3 Document breaking changes - -## Completion Checklist -- [ ] All unit tests passing -- [ ] Integration tests passing -- [ ] Performance benchmarks acceptable -- [ ] Documentation complete -- [ ] Code reviewed -- [ ] Deployment tested in staging - -## Future Improvements -The following improvements are identified but not part of this change proposal: - -### Batch Processing Enhancement -- **Related to**: Section 3.1.2 (Enable batch processing for GPU efficiency) -- **Description**: Implement true batch inference by sending multiple pages or documents to PaddleOCR simultaneously -- **Benefits**: Better GPU utilization, reduced overhead from model switching -- **Requirements**: Queue management, memory-aware batching, result aggregation -- **Recommendation**: Create a separate change proposal when ready to implement \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/design.md b/openspec/changes/archive/2025-11-24-pdf-layout-restoration/design.md deleted file mode 100644 index aee956c..0000000 --- a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/design.md +++ /dev/null @@ -1,361 +0,0 @@ -# Technical Design: PDF Layout Restoration and Preservation - -## Context - -### Background -The current PDF generation system loses critical layout information during the conversion process. Despite successfully extracting images, tables, and styled text in both OCR and Direct tracks, none of this information makes it to the final PDF output due to: - -1. **Empty implementations**: Image saving functions are stubs -2. **Path mismatches**: Saved paths don't match expected lookup keys -3. **Fake data dependencies**: Table rendering relies on non-existent image files -4. **Format degradation**: Rich formatting is reduced to plain text blocks - -### Current Issues - -#### Issue 1: OCR Track Image Loss -```python -# backend/app/services/pp_structure_enhanced.py -def _save_image(self, img_data, element_id: str, result_dir: Path): - """Save image data to file""" - # TODO: Implement image saving - pass # Lines 262, 414 - NEVER SAVES ANYTHING! -``` -Result: `img_path` from PP-Structure is ignored, no image files created. - -#### Issue 2: Direct Track Path Mismatch -```python -# Saves as: -element.content["saved_path"] = f"imgs/{element_id}.png" # line 745 - -# But converter looks for: -image_path = content.get("path") # line 180 - WRONG KEY! -``` -Result: Direct track images are saved but never found. - -#### Issue 3: Table Rendering Failure -```python -# Creates fake reference: -images_metadata.append({ - "path": f"table_{element.element_id}.png", # DOESN'T EXIST - "bbox": element.bbox -}) - -# Then tries to find it: -table_image = next((img for img in images_metadata - if "table" in img.get("path", "")), None) -if not table_image: - return # ALWAYS HAPPENS - NO RENDERING! -``` - -#### Issue 4: Text Style Loss -```python -# Has rich data: -StyleInfo(font='Arial', size=12, flags=BOLD|ITALIC, color='#000080') - -# But only uses: -c.drawString(x, y, text) # No font, size, or style applied! -``` - -### Constraints -- Must maintain backward compatibility with existing API -- Cannot break current OCR/Direct track separation -- Should work within existing UnifiedDocument model -- Must handle both track types appropriately - -## Goals / Non-Goals - -### Goals -1. **Restore image rendering**: Save and correctly reference all images -2. **Fix table layout**: Render tables using actual bbox data -3. **Preserve text formatting**: Apply fonts, sizes, colors, and styles -4. **Track-specific optimization**: Different rendering for OCR vs Direct -5. **Maintain positioning**: Accurate spatial layout preservation - -### Non-Goals -- Rewriting entire PDF generation system -- Changing UnifiedDocument structure -- Modifying extraction engines -- Supporting complex vector graphics -- Interactive PDF features (forms, annotations) - -## Decisions - -### Decision 1: Fix Image Handling Pipeline - -**What**: Implement actual image saving and correct path resolution - -**Implementation**: -```python -# pp_structure_enhanced.py -def _save_image(self, img_data, element_id: str, result_dir: Path): - """Save image data to file""" - img_dir = result_dir / "imgs" - img_dir.mkdir(parents=True, exist_ok=True) - - if isinstance(img_data, (str, Path)): - # Copy existing file - src_path = Path(img_data) - dst_path = img_dir / f"{element_id}.png" - shutil.copy2(src_path, dst_path) - else: - # Save image data - dst_path = img_dir / f"{element_id}.png" - Image.fromarray(img_data).save(dst_path) - - return f"imgs/{element_id}.png" # Relative path -``` - -**Path Resolution**: -```python -# pdf_generator_service.py - convert_unified_document_to_ocr_data -def _get_image_path(element): - """Get image path with fallback logic""" - content = element.content - - # Try multiple path locations - for key in ["saved_path", "path", "image_path"]: - if isinstance(content, dict) and key in content: - return content[key] - - # Check metadata - if hasattr(element, 'metadata') and element.metadata: - return element.metadata.get('path') - - return None -``` - -### Decision 2: Direct Table Bbox Usage - -**What**: Use table element's own bbox instead of fake image references - -**Current Problem**: -```python -# Creates fake image that doesn't exist -images_metadata.append({"path": f"table_{id}.png", "bbox": bbox}) -# Later fails to find it and skips rendering -``` - -**Solution**: -```python -def draw_table_region(self, c, table_element, page_width, page_height): - """Draw table using its own bbox""" - # Get bbox directly from element - bbox = table_element.get("bbox") - if not bbox: - # Fallback to polygon - bbox_polygon = table_element.get("bbox_polygon") - if bbox_polygon: - bbox = self._polygon_to_bbox(bbox_polygon) - - if not bbox: - logger.warning(f"No bbox for table {table_element.get('element_id')}") - return - - # Use bbox to position and render table - x, y, width, height = self._normalize_bbox(bbox, page_width, page_height) - self._render_table_html(c, table_element.get("content"), x, y, width, height) -``` - -### Decision 3: Track-Specific Rendering - -**What**: Different rendering approaches based on processing track - -**Implementation Strategy**: -```python -def generate_from_unified_document(self, unified_doc, output_path): - """Generate PDF with track-specific rendering""" - - track = unified_doc.metadata.processing_track - - if track == "direct": - return self._generate_direct_track_pdf(unified_doc, output_path) - else: # OCR track - return self._generate_ocr_track_pdf(unified_doc, output_path) - -def _generate_direct_track_pdf(self, unified_doc, output_path): - """Rich rendering for direct track""" - # Preserve: - # - Font families, sizes, weights - # - Text colors and backgrounds - # - Precise positioning - # - Line breaks and paragraph spacing - -def _generate_ocr_track_pdf(self, unified_doc, output_path): - """Simplified rendering for OCR track""" - # Best effort with: - # - Detected layout regions - # - Estimated font sizes - # - Basic positioning -``` - -### Decision 4: Style Preservation System - -**What**: Apply StyleInfo to text rendering - -**Text Rendering Enhancement**: -```python -def _apply_text_style(self, c, style_info): - """Apply text styling from StyleInfo""" - if not style_info: - return - - # Font selection - font_name = self._map_font(style_info.font) - if style_info.flags: - if style_info.flags & BOLD and style_info.flags & ITALIC: - font_name = f"{font_name}-BoldOblique" - elif style_info.flags & BOLD: - font_name = f"{font_name}-Bold" - elif style_info.flags & ITALIC: - font_name = f"{font_name}-Oblique" - - # Apply font and size - try: - c.setFont(font_name, style_info.size or 12) - except: - c.setFont("Helvetica", style_info.size or 12) - - # Apply color - if style_info.color: - r, g, b = self._parse_color(style_info.color) - c.setFillColorRGB(r, g, b) - -def draw_text_region_enhanced(self, c, element, page_width, page_height): - """Enhanced text rendering with style preservation""" - # Apply style - if hasattr(element, 'style') and element.style: - self._apply_text_style(c, element.style) - - # Render with line breaks - text_lines = element.content.split('\n') - for line in text_lines: - c.drawString(x, y, line) - y -= line_height -``` - -## Implementation Phases - -### Phase 1: Critical Fixes (Immediate) -1. Implement `_save_image()` in pp_structure_enhanced.py -2. Fix path resolution in converter -3. Fix table bbox usage -4. Test with sample documents - -### Phase 2: Basic Style Preservation (Week 1) -1. Implement style application for Direct track -2. Add font mapping system -3. Handle text colors -4. Preserve line breaks - -### Phase 3: Advanced Layout (Week 2) -1. Implement span-level rendering -2. Add paragraph alignment -3. Handle text indentation -4. Preserve list formatting - -### Phase 4: Optimization (Week 3) -1. Cache font metrics -2. Optimize image handling -3. Batch rendering operations -4. Performance testing - -## Risks / Trade-offs - -### Risk 1: Font Availability -**Risk**: System fonts may not match document fonts -**Mitigation**: Font mapping table with fallbacks -```python -FONT_MAPPING = { - 'Arial': 'Helvetica', - 'Times New Roman': 'Times-Roman', - 'Courier New': 'Courier', - # ... more mappings -} -``` - -### Risk 2: Complex Layouts -**Risk**: Some layouts too complex to preserve perfectly -**Mitigation**: Graceful degradation with logging -- Attempt full preservation -- Fall back to simpler rendering if needed -- Log what couldn't be preserved - -### Risk 3: Performance Impact -**Risk**: Style processing may slow down PDF generation -**Mitigation**: -- Cache computed styles -- Batch similar operations -- Lazy loading for images - -### Trade-off: Accuracy vs Speed -- Direct track: Prioritize accuracy (users chose quality) -- OCR track: Balance accuracy with processing time - -## Testing Strategy - -### Unit Tests -```python -def test_image_saving(): - """Test that images are actually saved""" - -def test_path_resolution(): - """Test path lookup with fallbacks""" - -def test_table_bbox_rendering(): - """Test tables render without fake images""" - -def test_style_application(): - """Test font/color/size application""" -``` - -### Integration Tests -- Process document with images → verify images in PDF -- Process document with tables → verify table layout -- Process styled document → verify formatting preserved - -### Visual Regression Tests -- Generate PDFs for test documents -- Compare with expected outputs -- Flag visual differences - -## Success Metrics - -1. **Image Rendering**: 100% of detected images appear in PDF -2. **Table Rendering**: 100% of tables rendered with correct layout -3. **Style Preservation**: - - Direct track: 90%+ style attributes preserved - - OCR track: Basic formatting maintained -4. **Performance**: <10% increase in generation time -5. **Quality**: User satisfaction with output appearance - -## Migration Plan - -### Rollout Strategy -1. Deploy fixes behind feature flag -2. Test with subset of documents -3. Gradual rollout monitoring quality -4. Full deployment after validation - -### Rollback Plan -- Feature flag to disable enhanced rendering -- Fallback to current implementation -- Keep old code paths during transition - -## Open Questions - -### Resolved -Q: Should we modify UnifiedDocument structure? -A: No, work within existing model for compatibility - -Q: How to handle missing fonts? -A: Font mapping table with safe fallbacks - -### Pending -Q: Should we support embedded fonts in PDFs? -- Requires investigation of PDF font embedding - -Q: How to handle RTL text and vertical writing? -- May need specialized text layout engine - -Q: Should we preserve hyperlinks and bookmarks? -- Depends on user requirements \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/proposal.md b/openspec/changes/archive/2025-11-24-pdf-layout-restoration/proposal.md deleted file mode 100644 index ae6f308..0000000 --- a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/proposal.md +++ /dev/null @@ -1,57 +0,0 @@ -# PDF Layout Restoration and Preservation - -## Problem -Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are **severely degraded compared to the original**, with multiple critical issues: - -### 1. Images Never Appear -- **OCR track**: `pp_structure_enhanced._save_image()` is an empty implementation (lines 262, 414), so detected images are never saved -- **Direct track**: Image paths are saved as `content["saved_path"]` but converter looks for `content.get("path")`, causing a mismatch -- **Result**: All PDFs are text-only, with no images whatsoever - -### 2. Tables Never Render -- Table elements use fake `table_*.png` references that don't exist as actual files -- `draw_table_region()` tries to find these non-existent images to get bbox coordinates -- When images aren't found, table rendering is skipped entirely -- **Result**: No tables appear in generated PDFs - -### 3. Text Layout is Broken -- All text uses single `drawString()` call with entire block as one line -- No line breaks, paragraph alignment, or text styling preserved -- Direct track extracts `StyleInfo` but it's completely ignored during PDF generation -- **Result**: Text appears as unformatted blocks at wrong positions - -### 4. Information Loss in Conversion -- Direct track data gets converted to legacy OCR format, losing rich metadata -- Span-level information (fonts, colors, styles) is discarded -- Precise positioning information is reduced to simple bboxes - -## Solution -Implement proper layout preservation for PDF generation: - -1. **Fix image handling**: Actually save images and use correct path references -2. **Fix table rendering**: Use element's own bbox instead of looking for fake images -3. **Preserve text formatting**: Use StyleInfo and span-level data for accurate rendering -4. **Track-specific rendering**: Different approaches for OCR vs Direct tracks - -## Impact -- **User Experience**: Output PDFs will actually be usable and readable -- **Functionality**: Tables and images will finally appear in outputs -- **Quality**: Direct track PDFs will closely match original formatting -- **Performance**: No negative impact, possibly faster by avoiding unnecessary conversions - -## Tasks -- Fix image saving and path references (Critical) -- Fix table rendering using actual bbox data (Critical) -- Implement track-specific PDF generation (Important) -- Preserve text styling and formatting (Important) -- Add span-level text rendering (Nice-to-have) - -## Deltas - -### result-export -```delta -+ image_handling: Proper image saving and path resolution -+ table_rendering: Direct bbox usage for table positioning -+ text_formatting: StyleInfo preservation and application -+ track_specific_rendering: OCR vs Direct track differentiation -``` \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/specs/result-export/spec.md b/openspec/changes/archive/2025-11-24-pdf-layout-restoration/specs/result-export/spec.md deleted file mode 100644 index 5007b86..0000000 --- a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/specs/result-export/spec.md +++ /dev/null @@ -1,88 +0,0 @@ -# Result Export Specification - -## ADDED Requirements - -### Requirement: Layout-Preserving PDF Generation -The system MUST generate PDF files that preserve the original document layout including images, tables, and text formatting. - -#### Scenario: Generate PDF with images -GIVEN a document processed through OCR or Direct track -WHEN images are detected and extracted -THEN the generated PDF MUST include all images at their original positions -AND images MUST maintain their aspect ratios -AND images MUST be saved to an imgs/ subdirectory - -#### Scenario: Generate PDF with tables -GIVEN a document containing tables -WHEN tables are detected and extracted -THEN the generated PDF MUST render tables with proper structure -AND tables MUST use their own bbox coordinates for positioning -AND tables MUST NOT depend on fake image references - -#### Scenario: Generate PDF with styled text -GIVEN a document processed through Direct track with StyleInfo -WHEN text elements have style information -THEN the generated PDF MUST apply font families (with mapping) -AND the PDF MUST apply font sizes -AND the PDF MUST apply text colors -AND the PDF MUST apply bold/italic formatting - -### Requirement: Track-Specific Rendering -The system MUST provide different rendering approaches based on the processing track. - -#### Scenario: Direct track rendering -GIVEN a document processed through Direct extraction -WHEN generating a PDF -THEN the system MUST use rich formatting preservation -AND maintain precise positioning from the original -AND apply all available StyleInfo - -#### Scenario: OCR track rendering -GIVEN a document processed through OCR -WHEN generating a PDF -THEN the system MUST use simplified rendering -AND apply best-effort positioning based on bbox -AND use estimated font sizes - -### Requirement: Image Path Resolution -The system MUST correctly resolve image paths with fallback logic. - -#### Scenario: Resolve saved image paths -GIVEN an element with image content -WHEN looking for the image path -THEN the system MUST check content["saved_path"] first -AND fallback to content["path"] if not found -AND fallback to content["image_path"] if not found -AND finally check metadata["path"] - -## MODIFIED Requirements - -### Requirement: PDF Generation Pipeline -The PDF generation pipeline MUST be enhanced to support layout preservation. - -#### Scenario: Enhanced PDF generation -GIVEN a UnifiedDocument from either track -WHEN generating a PDF -THEN the system MUST detect the processing track -AND route to the appropriate rendering method -AND preserve as much layout information as available - -### Requirement: Image Handling in PP-Structure -The PP-Structure enhanced module MUST actually save extracted images. - -#### Scenario: Save PP-Structure images -GIVEN PP-Structure extracts an image with img_path -WHEN processing the image element -THEN the _save_image method MUST save the image to disk -AND return a relative path for reference -AND handle both file paths and numpy arrays - -### Requirement: Table Rendering Logic -The table rendering MUST use direct bbox instead of image lookup. - -#### Scenario: Render table with direct bbox -GIVEN a table element with bbox coordinates -WHEN rendering the table in PDF -THEN the system MUST use the element's own bbox -AND NOT look for non-existent table image files -AND position the table accurately based on coordinates \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/tasks.md b/openspec/changes/archive/2025-11-24-pdf-layout-restoration/tasks.md deleted file mode 100644 index a72c981..0000000 --- a/openspec/changes/archive/2025-11-24-pdf-layout-restoration/tasks.md +++ /dev/null @@ -1,234 +0,0 @@ -# Implementation Tasks: PDF Layout Restoration - -## Phase 1: Critical Fixes (P0 - Immediate) - -### 1. Fix Image Handling -- [x] 1.1 Implement `_save_image()` in pp_structure_enhanced.py - - [x] 1.1.1 Create imgs subdirectory in result_dir - - [x] 1.1.2 Handle both file path and numpy array inputs - - [x] 1.1.3 Save with element_id as filename - - [x] 1.1.4 Return relative path for reference - - [x] 1.1.5 Add error handling and logging -- [x] 1.2 Fix path resolution in pdf_generator_service.py - - [x] 1.2.1 Create `_get_image_path()` helper with fallback logic - - [x] 1.2.2 Check saved_path, path, image_path keys - - [x] 1.2.3 Check metadata for path - - [x] 1.2.4 Update convert_unified_document_to_ocr_data to use helper -- [x] 1.3 Test image rendering - - [x] 1.3.1 Test with OCR track document (PASSED - PDFs generated correctly) - - [x] 1.3.2 Test with Direct track document (PASSED - 2 images detected, 3-page PDF generated) - - [x] 1.3.3 Verify images appear in PDF output (PASSED - image path issue exists, rendering works) - -### 2. Fix Table Rendering -- [x] 2.1 Remove dependency on fake image references - - [x] 2.1.1 Stop creating fake table_*.png references (changed to None) - - [x] 2.1.2 Remove image lookup fallback in draw_table_region -- [x] 2.2 Use direct bbox from table element - - [x] 2.2.1 Get bbox from table_element.get("bbox") - - [x] 2.2.2 Fallback to bbox_polygon if needed - - [x] 2.2.3 Implement _polygon_to_bbox converter (inline conversion implemented) -- [x] 2.3 Fix table HTML rendering - - [x] 2.3.1 Parse HTML content from table element - - [x] 2.3.2 Position table using normalized bbox - - [x] 2.3.3 Render with proper dimensions -- [x] 2.4 Test table rendering - - [x] 2.4.1 Test simple tables (PASSED - 2 tables detected and rendered correctly) - - [x] 2.4.2 Test complex multi-column tables (PASSED - 0 complex tables in test doc) - - [ ] 2.4.3 Test with both tracks (FAILED - OCR track timeout >180s, needs investigation) - -## Phase 2: Basic Style Preservation (P1 - Week 1) - -### 3. Implement Style Application System -- [x] 3.1 Create font mapping system - - [x] 3.1.1 Define FONT_MAPPING dictionary (20 common fonts mapped) - - [x] 3.1.2 Map common fonts to PDF standard fonts (Helvetica/Times/Courier) - - [x] 3.1.3 Add fallback to Helvetica for unknown fonts (with partial matching) -- [x] 3.2 Implement _apply_text_style() method - - [x] 3.2.1 Extract font family from StyleInfo (object and dict support) - - [x] 3.2.2 Handle bold/italic flags (compound variants like BoldOblique) - - [x] 3.2.3 Apply font size (with default fallback) - - [x] 3.2.4 Apply text color (using _parse_color) - - [x] 3.2.5 Handle errors gracefully (try-except with fallback to defaults) -- [x] 3.3 Create color parsing utilities - - [x] 3.3.1 Parse hex colors (#RRGGBB and #RGB) - - [x] 3.3.2 Parse RGB tuples (0-255 and 0-1 normalization) - - [x] 3.3.3 Convert to PDF color space (0-1 range for ReportLab) - -### 4. Track-Specific Rendering -- [x] 4.1 Add track detection in generate_from_unified_document - - [x] 4.1.1 Check unified_doc.metadata.processing_track (object and dict support) - - [x] 4.1.2 Route to _generate_direct_track_pdf or _generate_ocr_track_pdf -- [x] 4.2 Implement _generate_direct_track_pdf - - [x] 4.2.1 Process each page directly from UnifiedDocument (no legacy conversion) - - [x] 4.2.2 Apply StyleInfo to text elements (_draw_text_element_direct) - - [x] 4.2.3 Use precise positioning from element.bbox - - [x] 4.2.4 Preserve line breaks (split on \n, render multi-line) - - [x] 4.2.5 Implement _draw_text_element_direct with line break handling - - [x] 4.2.6 Implement _draw_table_element_direct for tables - - [x] 4.2.7 Implement _draw_image_element_direct for images -- [x] 4.3 Implement _generate_ocr_track_pdf - - [x] 4.3.1 Use legacy OCR data conversion (convert_unified_document_to_ocr_data) - - [x] 4.3.2 Route to existing _generate_pdf_from_data pipeline - - [x] 4.3.3 Maintain backward compatibility with OCR track behavior -- [x] 4.4 Test track-specific rendering - - [x] 4.4.1 Compare Direct track with original (PASSED - 15KB PDF with 3 pages, all features working) - - [ ] 4.4.2 Verify OCR track maintains quality (FAILED - No content extracted, needs investigation) - -## Phase 3: Advanced Layout (P2 - Week 2) - -### 5. Enhanced Text Rendering -- [x] 5.1 Implement line-by-line rendering (both tracks) - - [x] 5.1.1 Split text content by newlines (text.split('\n')) - - [x] 5.1.2 Calculate line height from font size (font_size * 1.2) - - [x] 5.1.3 Render each line with proper spacing (line_y = pdf_y - i * line_height) - - [x] 5.1.4 Direct track: _draw_text_element_direct (lines 1549-1693) - - [x] 5.1.5 OCR track: draw_text_region (lines 1113-1270, simplified) -- [x] 5.2 Add paragraph handling (Direct track only) - - [x] 5.2.1 Detect paragraph boundaries (via element.type PARAGRAPH) - - [x] 5.2.2 Apply spacing_before from metadata (line 1576, adjusts Y position) - - [x] 5.2.3 Handle indentation (indent/first_line_indent from metadata, lines 1564-1565) - - [x] 5.2.4 Record spacing_after for analysis (lines 1680-1689) - - [x] 5.2.5 Note: spacing_after is implicit in bbox-based layout (bbox_bottom_margin) - - [x] 5.2.6 OCR track: no paragraph handling (simple left-aligned rendering) -- [x] 5.3 Implement text alignment (Direct track only) - - [x] 5.3.1 Support left/right/center/justify (from StyleInfo.alignment) - - [x] 5.3.2 Calculate positioning based on alignment (line_x calculation) - - [x] 5.3.3 Apply to each text block (per-line alignment in _draw_text_element_direct) - - [x] 5.3.4 Justify alignment with word spacing distribution - - [x] 5.3.5 OCR track: left-aligned only (no StyleInfo available) - -### 6. List Formatting (Direct track only) -- [x] 6.1 Detect list elements from Direct track - - [x] 6.1.1 Identify LIST_ITEM elements (separate from text_elements, lines 636-637) - - [x] 6.1.2 Fallback detection via metadata and text patterns (_is_list_item_fallback, lines 1528-1567) - - [x] Check metadata for list_level, parent_item, children fields - - [x] Pattern matching for ordered lists (^\d+[\.\)]) and unordered (^[•·▪▫◦‣⁃\-\*]) - - [x] Auto-mark as LIST_ITEM if detected (lines 638-642) - - [x] 6.1.3 Group list items by proximity and level (_draw_list_elements_direct, lines 1589-1610) - - [x] 6.1.4 Determine list type via regex on first item (ordered/unordered, lines 1628-1636) - - [x] 6.1.5 Extract indent level from metadata (list_level) -- [x] 6.2 Render lists with proper formatting - - [x] 6.2.1 Sequential numbering across list items (list_counter, lines 1639-1665) - - [x] 6.2.2 Add bullets/numbers as list markers (stored in _list_marker metadata, lines 1649-1653) - - [x] 6.2.3 Apply indentation (20pt per level, lines 1738-1742) - - [x] 6.2.4 Multi-line list item alignment (marker_width calculation, lines 1755-1772) - - [x] Calculate marker width before rendering (line 1758) - - [x] Add marker_width to subsequent line indentation (lines 1770-1772) - - [x] 6.2.5 Remove original markers from text content (lines 1716-1723) - - [x] 6.2.6 Dedicated list item spacing (lines 1658-1683) - - [x] Default 3pt spacing_after for list items (except last item) - - [x] Calculate actual gap between adjacent items (line 1676) - - [x] Apply cumulative Y offset to push items down if gap < desired (lines 1678-1683) - - [x] Pass y_offset to _draw_text_element_direct (line 1668, 1690, 1716) - - [x] 6.2.7 Maintain list grouping via proximity (max_gap=30pt, lines 1597-1607) - -### 7. Span-Level Rendering (Advanced, Direct track only) -- [x] 7.1 Extract span information from Direct track - - [x] 7.1.1 Parse PyMuPDF span data in _process_text_block (direct_extraction_engine.py:418-453) - - [x] 7.1.2 Create span DocumentElements with per-span StyleInfo (lines 434-453) - - [x] 7.1.3 Store spans in element.children for inline styling (line 476) - - [x] 7.1.4 Extract span bbox, font, size, flags, color from PyMuPDF (lines 435-450) -- [x] 7.2 Render mixed-style lines - - [x] 7.2.1 Implement _draw_text_with_spans method (pdf_generator_service.py:1685-1734) - - [x] 7.2.2 Switch styles mid-line by iterating spans (lines 1709-1732) - - [x] 7.2.3 Apply span-specific style via _apply_text_style (lines 1715-1716) - - [x] 7.2.4 Track X position and calculate span widths (lines 1706, 1730-1732) - - [x] 7.2.5 Integrate span rendering in _draw_text_element_direct (lines 1822-1823, 1905-1914) - - [x] 7.2.6 Handle inline formatting with per-span fonts, sizes, colors, bold/italic -- [ ] 7.3 Future enhancements - - [ ] 7.3.1 Multi-line span support with line breaking logic - - [ ] 7.3.2 Preserve exact span positioning from PyMuPDF bbox - -### 8. Multi-Column Layout Support (P1 - Added 2025-11-24) -- [x] 8.1 Enable PyMuPDF reading order - - [x] 8.1.1 Add `sort=True` parameter to `page.get_text("dict")` (line 193) - - [x] 8.1.2 PyMuPDF provides built-in multi-column reading order - - [x] 8.1.3 Order: top-to-bottom, left-to-right within each row -- [x] 8.2 Preserve extraction order in PDF generation - - [x] 8.2.1 Remove Y-only sorting that broke reading order (line 686) - - [x] 8.2.2 Iterate through `page.elements` to preserve order (lines 679-687) - - [x] 8.2.3 Prevent re-sorting from destroying multi-column layout -- [x] 8.3 Implement column detection utilities - - [x] 8.3.1 Create `_sort_elements_for_reading_order()` method (lines 276-336) - - [x] 8.3.2 Create `_detect_columns()` for X-position clustering (lines 338-384) - - [x] 8.3.3 Note: Disabled in favor of PyMuPDF's native sorting -- [x] 8.4 Test multi-column layout handling - - [x] 8.4.1 Verify edit.pdf (2-column technical document) reading order - - [x] 8.4.2 Confirm "Technical Data Sheet" appears first, not 12th - - [x] 8.4.3 Validate left/right column interleaving by row - -**Result**: Multi-column PDFs now render with correct reading order (逐行從上到下,每行內從左到右) - -## Phase 4: Testing and Optimization (P2 - Week 3) - -### 8. Comprehensive Testing -- [ ] 8.1 Create test suite for layout preservation - - [ ] 8.1.1 Unit tests for each component - - [ ] 8.1.2 Integration tests for full pipeline - - [ ] 8.1.3 Visual regression tests -- [ ] 8.2 Test with various document types - - [ ] 8.2.1 Scientific papers (complex layout) - - [ ] 8.2.2 Business documents (tables/charts) - - [ ] 8.2.3 Books (chapters/paragraphs) - - [ ] 8.2.4 Forms (precise positioning) -- [ ] 8.3 Performance testing - - [ ] 8.3.1 Measure generation time - - [ ] 8.3.2 Profile memory usage - - [ ] 8.3.3 Identify bottlenecks - -### 9. Performance Optimization -- [ ] 9.1 Implement caching - - [ ] 9.1.1 Cache font metrics - - [ ] 9.1.2 Cache parsed styles - - [ ] 9.1.3 Reuse computed layouts -- [ ] 9.2 Optimize image handling - - [ ] 9.2.1 Lazy load images - - [ ] 9.2.2 Compress when appropriate - - [ ] 9.2.3 Stream large images -- [ ] 9.3 Batch operations - - [ ] 9.3.1 Group similar rendering ops - - [ ] 9.3.2 Minimize context switches - - [ ] 9.3.3 Use efficient data structures - -### 10. Documentation and Deployment -- [ ] 10.1 Update API documentation - - [ ] 10.1.1 Document new rendering capabilities - - [ ] 10.1.2 Add examples of improved output - - [ ] 10.1.3 Note performance characteristics -- [ ] 10.2 Create migration guide - - [ ] 10.2.1 Explain improvements - - [ ] 10.2.2 Note any breaking changes - - [ ] 10.2.3 Provide rollback instructions -- [ ] 10.3 Deployment preparation - - [ ] 10.3.1 Feature flag setup - - [ ] 10.3.2 Monitoring metrics - - [ ] 10.3.3 Rollback plan - -## Success Criteria - -### Must Have (Phase 1) -- [x] Images appear in generated PDFs (path issue exists but rendering works) -- [x] Tables render with correct layout (verified in tests) -- [x] No regression in existing functionality (backward compatible) -- [x] Fix Page attribute error (first_page.dimensions.width) - -### Should Have (Phase 2) -- [x] Text styling preserved in Direct track (span-level rendering working) -- [x] Font sizes and colors applied (verified in logs) -- [x] Line breaks maintained (multi-line text working) -- [x] Track-specific rendering (Direct track fully functional) - -### Nice to Have (Phase 3-4) -- [x] Paragraph formatting (spacing and indentation working) -- [x] List rendering (sequential numbering implemented) -- [x] Span-level styling (verified with 21+ spans per element) -- [ ] <10% performance overhead (not yet measured) -- [ ] Visual regression tests (not yet implemented) - -## Timeline - -- **Week 0**: Phase 1 - Critical fixes (images, tables) -- **Week 1**: Phase 2 - Basic style preservation -- **Week 2**: Phase 3 - Advanced layout features -- **Week 3**: Phase 4 - Testing and optimization -- **Week 4**: Review, documentation, and deployment \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/proposal.md b/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/proposal.md deleted file mode 100644 index 2a1faa3..0000000 --- a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/proposal.md +++ /dev/null @@ -1,50 +0,0 @@ -# Change: Fix PDF Layout Restoration Coordinate System and Dimension Calculation - -## Why - -During OCR track validation, the generated PDF (img1_layout.pdf) exhibits significant layout discrepancies compared to the original image (img1.png). Specific issues include: - -- **Element position misalignment**: Text elements appear at incorrect vertical positions -- **Abnormal vertical flipping**: Coordinate transformation errors cause content to be inverted -- **Incorrect scaling**: Content is stretched or compressed due to wrong page dimension calculations - -Code review identified two critical logic defects in `backend/app/services/pdf_generator_service.py`: - -1. **Page dimension calculation error**: The system ignores explicit page dimensions from OCR results and instead infers dimensions from bounding box boundaries, causing coordinate transformation errors -2. **Missing multi-page support**: The PDF generator only uses the first page's dimensions globally, unable to handle mixed orientation (portrait/landscape) or different-sized pages - -These issues violate the requirement "Enhanced PDF Export with Layout Preservation" in the result-export specification, making PDF exports unreliable for production use. - -## What Changes - -### 1. Fix calculate_page_dimensions Logic -- **MODIFIED**: `backend/app/services/pdf_generator_service.py::calculate_page_dimensions()` -- Change priority order: Check explicit `dimensions` field first, fallback to bbox calculation only when unavailable -- Ensure Y-axis coordinate transformation uses correct page height - -### 2. Implement Dynamic Per-Page Sizing -- **MODIFIED**: `backend/app/services/pdf_generator_service.py::_generate_direct_track_pdf()` -- **MODIFIED**: `backend/app/services/pdf_generator_service.py::_generate_ocr_track_pdf()` -- Call `pdf_canvas.setPageSize()` for each page to support varying page dimensions -- Pass current page height to coordinate transformation functions - -### 3. Update OCR Data Converter -- **MODIFIED**: `backend/app/services/ocr_to_unified_converter.py::convert_unified_document_to_ocr_data()` -- Add `page_dimensions` mapping to output: `{page_index: {width, height}}` -- Ensure OCR track has per-page dimension information - -## Impact - -**Affected specs**: result-export (MODIFIED requirement: "Enhanced PDF Export with Layout Preservation") - -**Affected code**: -- `backend/app/services/pdf_generator_service.py` (core fix) -- `backend/app/services/ocr_to_unified_converter.py` (data structure enhancement) - -**Breaking changes**: None - this is a bug fix that makes existing functionality work correctly - -**Benefits**: -- Accurate layout restoration for single-page documents -- Support for mixed-orientation multi-page documents -- Correct coordinate transformation without vertical flipping errors -- Improved reliability for PDF export feature diff --git a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/specs/result-export/spec.md b/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/specs/result-export/spec.md deleted file mode 100644 index c8e60d8..0000000 --- a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/specs/result-export/spec.md +++ /dev/null @@ -1,38 +0,0 @@ -# result-export Spec Delta - -## MODIFIED Requirements - -### Requirement: Enhanced PDF Export with Layout Preservation -The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. - -#### Scenario: Export PDF from direct extraction track -- **WHEN** exporting PDF from a direct-extraction processed document -- **THEN** the PDF SHALL maintain exact text positioning from source -- **AND** preserve original fonts and styles where possible -- **AND** include extracted images at correct positions - -#### Scenario: Export PDF from OCR track with full structure -- **WHEN** exporting PDF from OCR-processed document -- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types -- **AND** render tables with proper cell boundaries -- **AND** maintain reading order from parsing_res_list - -#### Scenario: Handle coordinate transformations correctly -- **WHEN** generating PDF from UnifiedDocument -- **THEN** system SHALL use explicit page dimensions from OCR results (not inferred from bounding boxes) -- **AND** correctly transform Y-axis coordinates from top-left (OCR) to bottom-left (PDF/ReportLab) origin -- **AND** prevent vertical flipping or position misalignment errors -- **AND** handle page size variations accurately - -#### Scenario: Support multi-page documents with varying dimensions -- **WHEN** generating PDF from multi-page document with mixed orientations -- **THEN** system SHALL apply correct page size for each page independently -- **AND** support both portrait and landscape pages in same document -- **AND** NOT use first page dimensions for all subsequent pages -- **AND** call setPageSize() for each new page before rendering content - -#### Scenario: Single-page layout verification -- **WHEN** user exports OCR-processed single-page document (e.g., img1.png) -- **THEN** generated PDF text positions SHALL match original image coordinates -- **AND** top-aligned text (e.g., headers) SHALL appear at correct vertical position -- **AND** no content SHALL be vertically flipped or offset from expected position diff --git a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/tasks.md b/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/tasks.md deleted file mode 100644 index 8e9e698..0000000 --- a/openspec/changes/archive/2025-11-25-fix-pdf-coordinate-system/tasks.md +++ /dev/null @@ -1,54 +0,0 @@ -# Implementation Tasks - -## 1. Fix Page Dimension Calculation -- [ ] 1.1 Modify `calculate_page_dimensions()` in `pdf_generator_service.py` - - [ ] Add priority check for `ocr_dimensions` field first - - [ ] Add fallback check for `dimensions` field - - [ ] Keep bbox calculation as final fallback only - - [ ] Add logging to show which dimension source is used -- [ ] 1.2 Add unit tests for dimension calculation logic - - [ ] Test with explicit dimensions provided - - [ ] Test with missing dimensions (fallback to bbox) - - [ ] Test edge cases (empty content, single element) - -## 2. Implement Dynamic Per-Page Sizing for Direct Track -- [ ] 2.1 Refactor `_generate_direct_track_pdf()` loop - - [ ] Extract current page dimensions inside loop - - [ ] Call `pdf_canvas.setPageSize()` for each page - - [ ] Pass current `page_height` to all drawing functions -- [ ] 2.2 Update drawing helper functions - - [ ] Ensure `_draw_text_element_direct()` receives `page_height` parameter - - [ ] Ensure `_draw_image_element()` receives `page_height` parameter - - [ ] Ensure `_draw_table_element()` receives `page_height` parameter - -## 3. Implement Dynamic Per-Page Sizing for OCR Track -- [ ] 3.1 Enhance `convert_unified_document_to_ocr_data()` - - [ ] Add `page_dimensions` field to output dict - - [ ] Map each page index to its dimensions: `{0: {width: X, height: Y}, ...}` - - [ ] Include `ocr_dimensions` field for backward compatibility -- [ ] 3.2 Refactor `_generate_ocr_track_pdf()` loop - - [ ] Read dimensions from `page_dimensions[page_num]` - - [ ] Call `pdf_canvas.setPageSize()` for each page - - [ ] Pass current `page_height` to coordinate transformation - -## 4. Testing & Validation -- [ ] 4.1 Single-page layout verification - - [ ] Process `img1.png` through OCR track - - [ ] Verify generated PDF text positions match original image - - [ ] Confirm no vertical flipping or offset issues - - [ ] Check "D" header appears at correct top position -- [ ] 4.2 Multi-page mixed orientation test - - [ ] Create test PDF with portrait and landscape pages - - [ ] Process through both OCR and Direct tracks - - [ ] Verify each page uses correct dimensions - - [ ] Confirm no content clipping or misalignment -- [ ] 4.3 Regression testing - - [ ] Run existing PDF generation tests - - [ ] Verify Direct track StyleInfo preservation - - [ ] Check table rendering still works correctly - - [ ] Ensure image extraction positions are correct - -## 5. Documentation -- [ ] 5.1 Update code comments in `pdf_generator_service.py` -- [ ] 5.2 Document coordinate transformation logic -- [ ] 5.3 Add inline examples for multi-page handling diff --git a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/IMPLEMENTATION_SUMMARY.md b/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index 4e65e6c..0000000 --- a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,362 +0,0 @@ -# Frontend Adjustable PP-StructureV3 Parameters - Implementation Summary - -## 🎯 Implementation Status - -**Critical Path (Sections 1-6):** ✅ **COMPLETE** -**UI/UX Polish (Section 7):** ✅ **COMPLETE** -**Backend Testing (Section 8.1-8.2):** ✅ **COMPLETE** (7/10 unit tests passing, API tests created) -**E2E Testing (Section 8.4):** ✅ **COMPLETE** (test suite created with authentication) -**Performance Testing (Section 8.5):** ✅ **COMPLETE** (benchmark suite created) -**Frontend Testing (Section 8.3):** ⚠️ **SKIPPED** (no test framework configured) -**Documentation (Section 9):** ⏳ Optional -**Deployment (Section 10):** ⏳ Optional - -## ✨ Implemented Features - -### Backend Implementation - -#### 1. Schema Definition ([backend/app/schemas/task.py](../../../backend/app/schemas/task.py)) -```python -class PPStructureV3Params(BaseModel): - """PP-StructureV3 fine-tuning parameters for OCR track""" - layout_detection_threshold: Optional[float] = Field(None, ge=0, le=1) - layout_nms_threshold: Optional[float] = Field(None, ge=0, le=1) - layout_merge_bboxes_mode: Optional[str] = Field(None, pattern="^(union|large|small)$") - layout_unclip_ratio: Optional[float] = Field(None, gt=0) - text_det_thresh: Optional[float] = Field(None, ge=0, le=1) - text_det_box_thresh: Optional[float] = Field(None, ge=0, le=1) - text_det_unclip_ratio: Optional[float] = Field(None, gt=0) - -class ProcessingOptions(BaseModel): - use_dual_track: bool = Field(default=True) - force_track: Optional[ProcessingTrackEnum] = None - language: str = Field(default="ch") - pp_structure_params: Optional[PPStructureV3Params] = None -``` - -**Features:** -- ✅ All 7 PP-StructureV3 parameters supported -- ✅ Comprehensive validation (min/max, patterns) -- ✅ Full backward compatibility (all fields optional) -- ✅ Auto-generated OpenAPI documentation - -#### 2. OCR Service ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py)) -```python -def _ensure_structure_engine(self, custom_params: Optional[Dict[str, any]] = None): - """ - Get or create PP-Structure engine with custom parameter support. - - Custom params override settings defaults - - No caching when custom params provided - - Falls back to cached default engine on error - """ -``` - -**Features:** -- ✅ Parameter priority: custom > settings default -- ✅ Conditional caching (custom params don't cache) -- ✅ Graceful fallback on errors -- ✅ Full parameter flow through processing pipeline -- ✅ Comprehensive logging for debugging - -#### 3. API Endpoint ([backend/app/routers/tasks.py](../../../backend/app/routers/tasks.py)) -```python -@router.post("/{task_id}/start") -async def start_task( - task_id: str, - options: Optional[ProcessingOptions] = None, - ... -): - """Accept processing options in request body with pp_structure_params""" -``` - -**Features:** -- ✅ Accepts `ProcessingOptions` in request body (not query params) -- ✅ Extracts and validates `pp_structure_params` -- ✅ Passes parameters through to OCR service -- ✅ Full backward compatibility - -### Frontend Implementation - -#### 4. TypeScript Types ([frontend/src/types/apiV2.ts](../../../frontend/src/types/apiV2.ts)) -```typescript -export interface PPStructureV3Params { - layout_detection_threshold?: number - layout_nms_threshold?: number - layout_merge_bboxes_mode?: 'union' | 'large' | 'small' - layout_unclip_ratio?: number - text_det_thresh?: number - text_det_box_thresh?: number - text_det_unclip_ratio?: number -} - -export interface ProcessingOptions { - use_dual_track?: boolean - force_track?: ProcessingTrack - language?: string - pp_structure_params?: PPStructureV3Params -} -``` - -#### 5. API Client ([frontend/src/services/apiV2.ts](../../../frontend/src/services/apiV2.ts)) -```typescript -async startTask(taskId: string, options?: ProcessingOptions): Promise { - const body = options || { use_dual_track: true, language: 'ch' } - const response = await this.client.post(`/tasks/${taskId}/start`, body) - return response.data -} -``` - -**Features:** -- ✅ Sends parameters in request body -- ✅ Type-safe parameter handling -- ✅ Full backward compatibility - -#### 6. UI Component ([frontend/src/components/PPStructureParams.tsx](../../../frontend/src/components/PPStructureParams.tsx)) - -**Features:** -- ✅ **Collapsible interface** - Shows/hides parameter controls -- ✅ **Preset configurations:** - - Default (use backend settings) - - High Quality (lower thresholds for better accuracy) - - Fast (higher thresholds for speed) - - Custom (manual adjustment) -- ✅ **Interactive controls:** - - Sliders for numeric parameters with real-time value display - - Dropdown for merge mode selection - - Help tooltips explaining each parameter -- ✅ **Parameter persistence:** - - Auto-save to localStorage on change - - Auto-load last used params on mount -- ✅ **Import/Export:** - - Export parameters as JSON file - - Import parameters from JSON file -- ✅ **Visual feedback:** - - Shows current vs default values - - Success notification on import - - Custom badge when parameters are modified - - Disabled state during processing -- ✅ **Reset functionality** - Clear all custom params - -#### 7. Integration ([frontend/src/pages/ProcessingPage.tsx](../../../frontend/src/pages/ProcessingPage.tsx)) - -**Features:** -- ✅ Shows PP-StructureV3 component when task is pending -- ✅ Hides component during/after processing -- ✅ Passes parameters to API when starting task -- ✅ Only includes params if user has customized them - -### Testing - -#### 8. Backend Unit Tests ([backend/tests/services/test_ppstructure_params.py](../../../backend/tests/services/test_ppstructure_params.py)) - -**Test Coverage:** -- ✅ Default parameters used when none provided -- ✅ Custom parameters override defaults -- ✅ Partial custom parameters (mixing custom + defaults) -- ✅ No caching for custom parameters -- ✅ Caching works for default parameters -- ✅ Fallback to defaults on error -- ✅ Parameter flow through processing pipeline -- ✅ Custom parameters logged for debugging - -#### 9. API Integration Tests ([backend/tests/api/test_ppstructure_params_api.py](../../../backend/tests/api/test_ppstructure_params_api.py)) - -**Test Coverage:** -- ✅ Schema validation (min/max, types, patterns) -- ✅ Accept custom parameters via API -- ✅ Backward compatibility (no params) -- ✅ Partial parameter sets -- ✅ Validation errors (422 responses) -- ✅ OpenAPI schema documentation -- ✅ Parameter serialization/deserialization - -## 🚀 Usage Guide - -### For End Users - -1. **Upload a document** via the upload page -2. **Navigate to Processing page** where the task is pending -3. **Click "Show Parameters"** to reveal PP-StructureV3 options -4. **Choose a preset** or customize individual parameters: - - **High Quality:** Best for complex documents with small text - - **Fast:** Best for simple documents where speed matters - - **Custom:** Fine-tune individual parameters -5. **Click "Start Processing"** - your custom parameters will be used -6. **Parameters are auto-saved** - they'll be restored next time - -### For Developers - -#### Backend: Using Custom Parameters - -```python -from app.services.ocr_service import OCRService - -ocr_service = OCRService() - -# Custom parameters -custom_params = { - 'layout_detection_threshold': 0.15, - 'text_det_thresh': 0.2 -} - -# Process with custom params -result = ocr_service.process( - file_path=Path('/path/to/document.pdf'), - pp_structure_params=custom_params -) -``` - -#### Frontend: Sending Custom Parameters - -```typescript -import { apiClientV2 } from '@/services/apiV2' - -// Start task with custom parameters -await apiClientV2.startTask(taskId, { - use_dual_track: true, - language: 'ch', - pp_structure_params: { - layout_detection_threshold: 0.15, - text_det_thresh: 0.2, - layout_merge_bboxes_mode: 'small' - } -}) -``` - -#### API: Request Example - -```bash -curl -X POST "http://localhost:8000/api/v2/tasks/{task_id}/start" \ - -H "Authorization: Bearer YOUR_TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "use_dual_track": true, - "language": "ch", - "pp_structure_params": { - "layout_detection_threshold": 0.15, - "layout_nms_threshold": 0.2, - "text_det_thresh": 0.25, - "layout_merge_bboxes_mode": "small" - } - }' -``` - -## 📊 Parameter Reference - -| Parameter | Range | Default | Effect | -|-----------|-------|---------|--------| -| `layout_detection_threshold` | 0-1 | 0.2 | Lower = detect more blocks
Higher = only high confidence | -| `layout_nms_threshold` | 0-1 | 0.2 | Lower = aggressive overlap removal
Higher = allow more overlap | -| `layout_merge_bboxes_mode` | small/union/large | small | small = conservative merging
large = aggressive merging | -| `layout_unclip_ratio` | >0 | 1.2 | Larger = looser boxes
Smaller = tighter boxes | -| `text_det_thresh` | 0-1 | 0.2 | Lower = detect more text
Higher = cleaner output | -| `text_det_box_thresh` | 0-1 | 0.3 | Lower = more text boxes
Higher = fewer false positives | -| `text_det_unclip_ratio` | >0 | 1.2 | Larger = looser text boxes
Smaller = tighter text boxes | - -### Preset Configurations - -**High Quality** (Better accuracy for complex documents): -```json -{ - "layout_detection_threshold": 0.1, - "layout_nms_threshold": 0.15, - "text_det_thresh": 0.1, - "text_det_box_thresh": 0.2, - "layout_merge_bboxes_mode": "small" -} -``` - -**Fast** (Better speed for simple documents): -```json -{ - "layout_detection_threshold": 0.3, - "layout_nms_threshold": 0.3, - "text_det_thresh": 0.3, - "text_det_box_thresh": 0.4, - "layout_merge_bboxes_mode": "large" -} -``` - -## 🔍 Technical Details - -### Parameter Priority -1. **Custom parameters** (via API request body) - Highest priority -2. **Backend settings** (from `.env` or `config.py`) - Default fallback - -### Caching Behavior -- **Default parameters:** Engine is cached and reused -- **Custom parameters:** New engine created each time (no cache pollution) -- **Error handling:** Falls back to cached default engine on failure - -### Performance Considerations -- Custom parameters create new engine instances (slight overhead) -- No caching means each request with custom params loads models fresh -- Memory usage is managed - engines are cleaned up after processing -- OCR track only - Direct track ignores these parameters - -### Backward Compatibility -- All parameters are optional -- Existing API calls without `pp_structure_params` work unchanged -- Default behavior matches pre-feature behavior -- No database migration required - -## ✅ Testing Implementation Complete - -### Unit Tests ([backend/tests/services/test_ppstructure_params.py](../../../backend/tests/services/test_ppstructure_params.py)) -- ✅ 7/10 tests passing -- ✅ Parameter validation and defaults -- ✅ Custom parameter override -- ✅ Caching behavior -- ✅ Fallback handling -- ✅ Parameter logging - -### E2E Tests ([backend/tests/e2e/test_ppstructure_params_e2e.py](../../../backend/tests/e2e/test_ppstructure_params_e2e.py)) -- ✅ Full workflow tests (upload → process → verify) -- ✅ Authentication with provided credentials -- ✅ Preset comparison tests -- ✅ Result verification - -### Performance Tests ([backend/tests/performance/test_ppstructure_params_performance.py](../../../backend/tests/performance/test_ppstructure_params_performance.py)) -- ✅ Engine initialization benchmarks -- ✅ Memory usage tracking -- ✅ Memory leak detection -- ✅ Cache pollution prevention - -### Test Runner ([backend/tests/run_ppstructure_tests.sh](../../../backend/tests/run_ppstructure_tests.sh)) -```bash -# Run specific test suites -./backend/tests/run_ppstructure_tests.sh unit -./backend/tests/run_ppstructure_tests.sh api -./backend/tests/run_ppstructure_tests.sh e2e # Requires server -./backend/tests/run_ppstructure_tests.sh performance -./backend/tests/run_ppstructure_tests.sh all -``` - -## 📝 Next Steps (Optional) - -### Documentation (Section 9) -- User guide with screenshots -- API documentation updates -- Common use cases and examples - -### Deployment (Section 10) -- Usage analytics -- A/B testing framework -- Performance monitoring - -## 🎉 Summary - -**Lines of Code Changed:** -- Backend: ~300 lines (ocr_service.py, routers/tasks.py, schemas/task.py) -- Frontend: ~350 lines (PPStructureParams.tsx, ProcessingPage.tsx, apiV2.ts, types) -- Tests: ~500 lines (unit tests + integration tests) - -**Key Achievements:** -- ✅ Full end-to-end parameter customization -- ✅ Production-ready UI with presets and persistence -- ✅ Comprehensive test coverage (80%+ backend) -- ✅ 100% backward compatible -- ✅ Zero breaking changes -- ✅ Auto-generated API documentation - -**Ready for Production!** 🚀 diff --git a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/proposal.md b/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/proposal.md deleted file mode 100644 index 5ef8af0..0000000 --- a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/proposal.md +++ /dev/null @@ -1,207 +0,0 @@ -# Change: Frontend-Adjustable PP-StructureV3 Parameters - -## Why - -Currently, PP-StructureV3 parameters are fixed in backend configuration (`backend/app/core/config.py`), limiting users' ability to fine-tune OCR behavior for different document types. Users have reported: - -1. **Over-merging issues**: Complex diagrams being simplified into fewer blocks (6 vs 27 regions) -2. **Missing small text**: Low-contrast or small text being ignored -3. **Excessive overlap**: Multiple bounding boxes overlapping unnecessarily -4. **Document-specific needs**: Different documents require different parameter tuning - -Making these parameters adjustable from the frontend would allow users to: -- Optimize OCR quality for specific document types -- Balance between detection accuracy and processing speed -- Fine-tune layout analysis for complex documents -- Resolve element detection issues without backend changes - -## What Changes - -### 1. API Schema Enhancement -- **NEW**: `PPStructureV3Params` schema with 7 adjustable parameters -- **MODIFIED**: `ProcessingOptions` schema to include optional `pp_structure_params` -- All parameters are optional with backend defaults as fallback - -### 2. Backend OCR Service -- **MODIFIED**: `backend/app/services/ocr_service.py` - - Update `_ensure_structure_engine()` to accept custom parameters - - Add parameter priority: custom > settings default - - Implement smart caching (no cache for custom params) - - Pass parameters through processing methods chain - -### 3. Task API Endpoints -- **MODIFIED**: `POST /api/v2/tasks/{task_id}/start` - - Accept `ProcessingOptions` in request body (not query params) - - Extract and forward PP-StructureV3 parameters to OCR service - -### 4. Frontend Implementation -- **NEW**: PP-StructureV3 parameter types in `apiV2.ts` -- **MODIFIED**: `startTask()` API method to send parameters in body -- **NEW**: UI components for parameter adjustment (sliders, help text) -- **NEW**: Preset configurations (default, high-quality, fast, custom) - -## Impact - -**Affected specs**: None (new feature, backward compatible) - -**Affected code**: -- `backend/app/schemas/task.py` (schema definitions) ✅ DONE -- `backend/app/services/ocr_service.py` (OCR processing) -- `backend/app/routers/tasks.py` (API endpoint) -- `frontend/src/types/apiV2.ts` (TypeScript types) -- `frontend/src/services/apiV2.ts` (API client) -- `frontend/src/pages/TaskDetailPage.tsx` (UI components) - -**Breaking changes**: None - all changes are backward compatible with optional parameters - -**Benefits**: -- User-controlled OCR optimization -- Better handling of diverse document types -- Reduced need for backend configuration changes -- Improved OCR accuracy for complex layouts - -## Parameter Reference - -### PP-StructureV3 Parameters (7 total) - -1. **layout_detection_threshold** (0-1) - - Lower → detect more blocks (including weak signals) - - Higher → only high-confidence blocks - - Default: 0.2 - -2. **layout_nms_threshold** (0-1) - - Lower → aggressive overlap removal - - Higher → allow more overlapping boxes - - Default: 0.2 - -3. **layout_merge_bboxes_mode** (union|large|small) - - small: conservative merging - - large: aggressive merging - - union: middle ground - - Default: small - -4. **layout_unclip_ratio** (>0) - - Larger → looser bounding boxes - - Smaller → tighter bounding boxes - - Default: 1.2 - -5. **text_det_thresh** (0-1) - - Lower → detect more small/low-contrast text - - Higher → cleaner but may miss text - - Default: 0.2 - -6. **text_det_box_thresh** (0-1) - - Lower → more text boxes retained - - Higher → fewer false positives - - Default: 0.3 - -7. **text_det_unclip_ratio** (>0) - - Larger → looser text boxes - - Smaller → tighter text boxes - - Default: 1.2 - -## Testing Requirements - -1. **Unit Tests**: Parameter validation and passing through service layers -2. **Integration Tests**: Different parameter combinations on same document -3. **Frontend E2E Tests**: UI parameter input → API call → result verification -4. **Performance Tests**: Ensure custom params don't cause memory leaks - ---- - -## ✅ Implementation Status - -**Status**: ✅ **COMPLETE** (Sections 1-8.2) -**Implementation Date**: 2025-01-25 -**Total Effort**: 2 days - -### Completed Components - -#### Backend (100%) -- ✅ **Schema Definition** ([backend/app/schemas/task.py](../../../backend/app/schemas/task.py)) - - `PPStructureV3Params` with 7 parameters + validation - - `ProcessingOptions` with optional `pp_structure_params` - -- ✅ **OCR Service** ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py)) - - `_ensure_structure_engine()` with custom parameter support - - Parameter priority: custom > settings - - Smart caching (no cache for custom params) - - Full parameter flow through processing pipeline - -- ✅ **API Endpoint** ([backend/app/routers/tasks.py](../../../backend/app/routers/tasks.py)) - - Accepts `ProcessingOptions` in request body - - Validates and forwards parameters to OCR service - -- ✅ **Unit Tests** ([backend/tests/services/test_ppstructure_params.py](../../../backend/tests/services/test_ppstructure_params.py)) - - 8 test classes covering validation, flow, caching, logging - -- ✅ **API Tests** ([backend/tests/api/test_ppstructure_params_api.py](../../../backend/tests/api/test_ppstructure_params_api.py)) - - Schema validation, endpoint testing, OpenAPI docs - -#### Frontend (100%) -- ✅ **TypeScript Types** ([frontend/src/types/apiV2.ts](../../../frontend/src/types/apiV2.ts)) - - `PPStructureV3Params` interface - - Updated `ProcessingOptions` - -- ✅ **API Client** ([frontend/src/services/apiV2.ts](../../../frontend/src/services/apiV2.ts)) - - `startTask()` sends parameters in request body - -- ✅ **UI Component** ([frontend/src/components/PPStructureParams.tsx](../../../frontend/src/components/PPStructureParams.tsx)) - - Collapsible parameter controls - - 3 presets (default, high-quality, fast) - - Auto-save to localStorage - - Import/Export JSON - - Help tooltips for each parameter - - Visual feedback (current vs default) - -- ✅ **Integration** ([frontend/src/pages/ProcessingPage.tsx](../../../frontend/src/pages/ProcessingPage.tsx)) - - Shows component when task is pending - - Passes parameters to API - -### Usage - -**Backend API:** -```bash -curl -X POST "http://localhost:8000/api/v2/tasks/{task_id}/start" \ - -H "Content-Type: application/json" \ - -d '{ - "use_dual_track": true, - "language": "ch", - "pp_structure_params": { - "layout_detection_threshold": 0.15, - "text_det_thresh": 0.2 - } - }' -``` - -**Frontend:** -1. Upload document -2. Navigate to Processing page -3. Click "Show Parameters" -4. Choose preset or customize -5. Click "Start Processing" - -### Testing Status -- ✅ **Unit Tests** (Section 8.1): 7/10 passing - Core functionality verified -- ✅ **API Tests** (Section 8.2): Test file created -- ✅ **E2E Tests** (Section 8.4): Test file created with authentication -- ✅ **Performance Tests** (Section 8.5): Benchmark suite created -- ⚠️ **Frontend Tests** (Section 8.3): Skipped - no test framework configured - -### Test Runner -```bash -# Run all tests -./backend/tests/run_ppstructure_tests.sh all - -# Run specific test types -./backend/tests/run_ppstructure_tests.sh unit -./backend/tests/run_ppstructure_tests.sh api -./backend/tests/run_ppstructure_tests.sh e2e # Requires server running -./backend/tests/run_ppstructure_tests.sh performance -``` - -### Remaining Optional Work -- ⏳ User documentation (Section 9) -- ⏳ Deployment monitoring (Section 10) - -See [IMPLEMENTATION_SUMMARY.md](./IMPLEMENTATION_SUMMARY.md) for detailed documentation. \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/specs/ocr-processing/spec.md deleted file mode 100644 index f53ac54..0000000 --- a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/specs/ocr-processing/spec.md +++ /dev/null @@ -1,100 +0,0 @@ -# ocr-processing Spec Delta - -## ADDED Requirements - -### Requirement: Frontend-Adjustable PP-StructureV3 Parameters -The system SHALL allow frontend users to dynamically adjust PP-StructureV3 OCR parameters for fine-tuning document processing without backend configuration changes. - -#### Scenario: User adjusts layout detection threshold -- **GIVEN** a user is processing a document with OCR track -- **WHEN** the user sets `layout_detection_threshold` to 0.1 (lower than default 0.2) -- **THEN** the OCR engine SHALL detect more layout blocks including weak signals -- **AND** the processing SHALL use the custom parameter instead of backend defaults -- **AND** the custom parameter SHALL NOT be cached for reuse - -#### Scenario: User selects high-quality preset configuration -- **GIVEN** a user wants to process a complex document with many small text elements -- **WHEN** the user selects "High Quality" preset mode -- **THEN** the system SHALL automatically set: - - `layout_detection_threshold` to 0.1 - - `layout_nms_threshold` to 0.15 - - `text_det_thresh` to 0.1 - - `text_det_box_thresh` to 0.2 -- **AND** process the document with these optimized parameters - -#### Scenario: User adjusts text detection parameters -- **GIVEN** a document with low-contrast text -- **WHEN** the user sets: - - `text_det_thresh` to 0.05 (very low) - - `text_det_unclip_ratio` to 1.5 (larger boxes) -- **THEN** the OCR SHALL detect more small and low-contrast text -- **AND** text bounding boxes SHALL be expanded by the specified ratio - -#### Scenario: Parameters are sent via API request body -- **GIVEN** a frontend application with parameter adjustment UI -- **WHEN** the user starts task processing with custom parameters -- **THEN** the frontend SHALL send parameters in the request body (not query params): - ```json - POST /api/v2/tasks/{task_id}/start - { - "use_dual_track": true, - "force_track": "ocr", - "language": "ch", - "pp_structure_params": { - "layout_detection_threshold": 0.15, - "layout_merge_bboxes_mode": "small", - "text_det_thresh": 0.1 - } - } - ``` -- **AND** the backend SHALL parse and apply these parameters - -#### Scenario: Backward compatibility is maintained -- **GIVEN** existing API clients without PP-StructureV3 parameter support -- **WHEN** a task is started without `pp_structure_params` -- **THEN** the system SHALL use backend default settings -- **AND** processing SHALL work exactly as before -- **AND** no errors SHALL occur - -#### Scenario: Invalid parameters are rejected -- **GIVEN** a request with invalid parameter values -- **WHEN** the user sends: - - `layout_detection_threshold` = 1.5 (exceeds max 1.0) - - `layout_merge_bboxes_mode` = "invalid" (not in allowed values) -- **THEN** the API SHALL return 422 Validation Error -- **AND** provide clear error messages about invalid parameters - -#### Scenario: Custom parameters affect only current processing -- **GIVEN** multiple concurrent OCR processing tasks -- **WHEN** Task A uses custom parameters and Task B uses defaults -- **THEN** Task A SHALL process with its custom parameters -- **AND** Task B SHALL process with default parameters -- **AND** no parameter interference SHALL occur between tasks - -### Requirement: PP-StructureV3 Parameter UI Controls -The frontend SHALL provide intuitive UI controls for adjusting PP-StructureV3 parameters with appropriate constraints and help text. - -#### Scenario: Slider controls for numeric parameters -- **GIVEN** the parameter adjustment UI is displayed -- **WHEN** the user adjusts a numeric parameter slider -- **THEN** the slider SHALL enforce min/max constraints: - - Threshold parameters: 0.0 to 1.0 - - Ratio parameters: > 0 (typically 0.5 to 3.0) -- **AND** display current value in real-time -- **AND** show help text explaining the parameter effect - -#### Scenario: Dropdown for merge mode selection -- **GIVEN** the layout merge mode parameter -- **WHEN** the user clicks the dropdown -- **THEN** the UI SHALL show exactly three options: - - "small" (conservative merging) - - "large" (aggressive merging) - - "union" (middle ground) -- **AND** display description for each option - -#### Scenario: Parameters shown only for OCR track -- **GIVEN** a document processing interface -- **WHEN** the user selects processing track -- **THEN** PP-StructureV3 parameters SHALL be shown ONLY when OCR track is selected -- **AND** SHALL be hidden for Direct track -- **AND** SHALL be disabled for Auto track until track is determined \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/tasks.md b/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/tasks.md deleted file mode 100644 index 59d5a96..0000000 --- a/openspec/changes/archive/2025-11-25-frontend-adjustable-ppstructure-params/tasks.md +++ /dev/null @@ -1,178 +0,0 @@ -# Implementation Tasks - -## 1. Backend Schema (✅ COMPLETED) -- [x] 1.1 Define `PPStructureV3Params` schema in `backend/app/schemas/task.py` - - [x] Add 7 parameter fields with validation - - [x] Set appropriate constraints (ge, le, gt, pattern) - - [x] Add descriptive documentation -- [x] 1.2 Update `ProcessingOptions` schema - - [x] Add optional `pp_structure_params` field - - [x] Ensure backward compatibility - -## 2. Backend OCR Service Implementation -- [x] 2.1 Modify `backend/app/services/ocr_service.py` - - [x] Update `_ensure_structure_engine()` method signature - - [x] Add `custom_params: Optional[Dict[str, Any]] = None` parameter - - [x] Implement parameter priority logic (custom > settings) - - [x] Conditional caching (skip cache for custom params) - - [x] Update `process_image()` method - - [x] Add `pp_structure_params` parameter - - [x] Pass params to `_ensure_structure_engine()` - - [x] Update `process_with_dual_track()` method - - [x] Add `pp_structure_params` parameter - - [x] Forward params to OCR track processing - - [x] Update main `process()` method - - [x] Add `pp_structure_params` parameter - - [x] Ensure params flow through all code paths -- [x] 2.2 Add parameter logging - - [x] Log when custom params are used - - [x] Log parameter values for debugging - - [x] Add performance metrics for custom vs default - -## 3. Backend API Endpoint Updates -- [x] 3.1 Modify `backend/app/routers/tasks.py` - - [x] Update `start_task` endpoint - - [x] Accept `ProcessingOptions` as request body (not query params) - - [x] Extract `pp_structure_params` from options - - [x] Convert to dict using `model_dump(exclude_none=True)` - - [x] Pass to OCR service - - [x] Update `analyze_document` endpoint (if needed) - - [x] Support PP-StructureV3 params for analysis -- [x] 3.2 Update API documentation - - [x] Add OpenAPI schema for new parameters - - [x] Include parameter descriptions and ranges - -## 4. Frontend TypeScript Types -- [x] 4.1 Update `frontend/src/types/apiV2.ts` - - [x] Define `PPStructureV3Params` interface - ```typescript - export interface PPStructureV3Params { - layout_detection_threshold?: number - layout_nms_threshold?: number - layout_merge_bboxes_mode?: 'union' | 'large' | 'small' - layout_unclip_ratio?: number - text_det_thresh?: number - text_det_box_thresh?: number - text_det_unclip_ratio?: number - } - ``` - - [x] Update `ProcessingOptions` interface - - [x] Add `pp_structure_params?: PPStructureV3Params` - -## 5. Frontend API Client Updates -- [x] 5.1 Modify `frontend/src/services/apiV2.ts` - - [x] Update `startTask()` method - - [x] Change from query params to request body - - [x] Send full `ProcessingOptions` object - ```typescript - async startTask(taskId: string, options?: ProcessingOptions): Promise { - const response = await this.client.post( - `/tasks/${taskId}/start`, - options // Send as body, not query params - ) - return response.data - } - ``` - -## 6. Frontend UI Implementation -- [x] 6.1 Create parameter adjustment component - - [x] Create `frontend/src/components/PPStructureParams.tsx` - - [x] Slider components for numeric parameters - - [x] Select dropdown for merge mode - - [x] Help tooltips for each parameter - - [x] Reset to defaults button -- [x] 6.2 Add preset configurations - - [x] Default mode (use backend defaults) - - [x] High Quality mode (lower thresholds) - - [x] Fast mode (higher thresholds) - - [x] Custom mode (show all sliders) -- [x] 6.3 Integrate into task processing flow - - [x] Add to `ProcessingPage.tsx` - - [x] Show only when task is pending - - [x] Store params in component state - - [x] Pass params to `startTask()` API call - -## 7. Frontend UI/UX Polish -- [x] 7.1 Add visual feedback - - [x] Loading state while processing with custom params - - [x] Success/error notifications with save confirmation - - [x] Parameter value display (current vs default with highlight) -- [x] 7.2 Add parameter persistence - - [x] Save last used params to localStorage (auto-save on change) - - [x] Create preset configurations (default, high-quality, fast) - - [x] Import/export parameter configurations (JSON format) -- [x] 7.3 Add help documentation - - [x] Inline help text for each parameter with tooltips - - [x] Descriptive labels explaining parameter effects - - [x] Info panel explaining OCR track requirement - -## 8. Testing -- [x] 8.1 Backend unit tests - - [x] Test schema validation (min/max, types, patterns) - - [x] Test parameter passing through service layers - - [x] Test caching behavior with custom params (no caching) - - [x] Test parameter priority (custom > settings) - - [x] Test fallback to defaults on error - - [x] Test parameter flow through processing pipeline - - [x] Test logging of custom parameters -- [x] 8.2 API integration tests - - [x] Test endpoint with various parameter combinations - - [x] Test backward compatibility (no params) - - [x] Test validation errors for invalid params (422 responses) - - [x] Test partial parameter sets - - [x] Test OpenAPI schema documentation - - [x] Test parameter serialization/deserialization -- [ ] 8.3 Frontend component tests - - [ ] Test slider value changes - - [ ] Test preset selection - - [ ] Test API call generation -- [ ] 8.4 End-to-end tests - - [ ] Upload document → adjust params → process → verify results - - [ ] Test with different document types - - [ ] Compare results: default vs custom params -- [ ] 8.5 Performance tests - - [ ] Ensure no memory leaks with custom params - - [ ] Verify engine cleanup after processing - - [ ] Benchmark processing time impact - -## 9. Documentation -- [ ] 9.1 Update API documentation - - [ ] Document new request body format - - [ ] Add parameter reference guide - - [ ] Include example requests -- [ ] 9.2 Create user guide - - [ ] When to adjust each parameter - - [ ] Common scenarios and recommended settings - - [ ] Troubleshooting guide -- [ ] 9.3 Update README - - [ ] Add feature description - - [ ] Include screenshots of UI - - [ ] Add configuration examples - -## 10. Deployment & Rollout -- [ ] 10.1 Database migration (if needed) - - [ ] Store user parameter preferences - - [ ] Log parameter usage statistics -- [ ] 10.2 Feature flag (optional) - - [ ] Add feature toggle for gradual rollout - - [ ] Default to enabled -- [ ] 10.3 Monitoring - - [ ] Add metrics for parameter usage - - [ ] Track processing success rates by param config - - [ ] Monitor performance impact - -## Critical Path for Testing - -**Minimum required for frontend testing:** -1. ✅ Backend Schema (Section 1) - DONE -2. Backend OCR Service (Section 2) - REQUIRED -3. Backend API Endpoint (Section 3) - REQUIRED -4. Frontend Types (Section 4) - REQUIRED -5. Frontend API Client (Section 5) - REQUIRED -6. Basic UI Component (Section 6.1-6.3) - REQUIRED - -**Nice to have but not blocking:** -- UI Polish (Section 7) -- Full test suite (Section 8) -- Documentation (Section 9) -- Deployment features (Section 10) \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-ocr-processing.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-ocr-processing.md deleted file mode 100644 index 397ca7a..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-ocr-processing.md +++ /dev/null @@ -1,146 +0,0 @@ -# Spec Delta: ocr-processing - -## Changes to OCR Processing Specification - -### 1. Model Lifecycle Management - -#### Added: ModelManager Class -```python -class ModelManager: - """Manages model lifecycle with reference counting and idle timeout""" - - def load_model(self, model_id: str, config: Dict) -> Model - """Load a model or return existing instance with ref count++""" - - def unload_model(self, model_id: str) -> None - """Decrement ref count and unload if zero""" - - def get_model(self, model_id: str) -> Optional[Model] - """Get model instance if loaded""" - - def teardown(self) -> None - """Force unload all models immediately""" -``` - -#### Modified: PPStructureV3 Integration -- Remove permanent exemption from unloading (lines 255-267) -- Wrap PP-StructureV3 in ModelManager -- Support lazy loading on first access -- Add unload capability with cache clearing - -### 2. Service Architecture - -#### Added: OCRServicePool -```python -class OCRServicePool: - """Pool of OCRService instances (one per device)""" - - def acquire(self, device: str = "GPU:0") -> OCRService - """Get service from pool with semaphore control""" - - def release(self, service: OCRService) -> None - """Return service to pool""" -``` - -#### Modified: OCRService Instantiation -- Replace direct instantiation with pool.acquire() -- Add finally blocks for pool.release() -- Handle pool exhaustion gracefully - -### 3. Memory Management - -#### Added: MemoryGuard Class -```python -class MemoryGuard: - """Monitor and control memory usage""" - - def check_memory(self, required_mb: int = 0) -> bool - """Check if sufficient memory available""" - - def get_memory_stats(self) -> Dict - """Get current memory usage statistics""" - - def predict_memory(self, operation: str, params: Dict) -> int - """Predict memory requirement for operation""" -``` - -#### Modified: Processing Flow -- Add memory checks before operations -- Implement CPU fallback when GPU memory low -- Add progressive loading for multi-page documents - -### 4. Concurrency Control - -#### Added: Prediction Semaphores -```python -# Maximum concurrent PP-StructureV3 predictions -MAX_CONCURRENT_PREDICTIONS = 2 - -prediction_semaphore = asyncio.Semaphore(MAX_CONCURRENT_PREDICTIONS) - -async def predict_with_limit(self, image, custom_params=None): - async with prediction_semaphore: - return await self._predict(image, custom_params) -``` - -#### Added: Selective Processing -```python -class ProcessingConfig: - enable_charts: bool = True - enable_formulas: bool = True - enable_tables: bool = True - batch_size: int = 10 # Pages per batch -``` - -### 5. Resource Cleanup - -#### Added: Cleanup Hooks -```python -@app.on_event("shutdown") -async def shutdown_handler(): - """Graceful shutdown with model unloading""" - await model_manager.teardown() - await service_pool.shutdown() -``` - -#### Modified: Task Completion -```python -async def process_task(task_id: str): - service = None - try: - service = await pool.acquire() - # ... processing ... - finally: - if service: - await pool.release(service) - await cleanup_task_resources(task_id) -``` - -## Configuration Changes - -### Added Settings -```yaml -memory: - gpu_threshold_warning: 0.8 # 80% usage - gpu_threshold_critical: 0.95 # 95% usage - model_idle_timeout: 300 # 5 minutes - enable_memory_monitor: true - monitor_interval: 10 # seconds - -pool: - max_services_per_device: 2 - queue_timeout: 60 # seconds - -concurrency: - max_predictions: 2 - max_batch_size: 10 -``` - -## Breaking Changes -None - All changes are backward compatible optimizations. - -## Migration Path -1. Deploy new code with default settings (no config changes needed) -2. Monitor memory metrics via new endpoints -3. Tune parameters based on workload -4. Enable selective processing if needed \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-task-management.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-task-management.md deleted file mode 100644 index 5af8d19..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/delta-task-management.md +++ /dev/null @@ -1,225 +0,0 @@ -# Spec Delta: task-management - -## Changes to Task Management Specification - -### 1. Task Resource Management - -#### Modified: Task Creation -```python -class TaskManager: - def create_task(self, request: TaskCreateRequest) -> Task: - """Create task with resource estimation""" - task = Task(...) - task.estimated_memory_mb = self._estimate_memory(request) - task.assigned_device = self._select_device(task.estimated_memory_mb) - return task -``` - -#### Added: Resource Tracking -```python -class Task(BaseModel): - # Existing fields... - - # New resource tracking fields - estimated_memory_mb: Optional[int] = None - actual_memory_mb: Optional[int] = None - assigned_device: Optional[str] = None - service_instance_id: Optional[str] = None - resource_cleanup_completed: bool = False -``` - -### 2. Task Execution - -#### Modified: Task Router -```python -@router.post("/tasks/{task_id}/start") -async def start_task(task_id: str, params: TaskStartRequest): - # Old approach - creates new service - # service = OCRService(device=device) - - # New approach - uses pooled service - service = await service_pool.acquire(device=params.device) - try: - result = await service.process(task_id, params) - finally: - await service_pool.release(service) -``` - -#### Added: Task Queue Management -```python -class TaskQueue: - """Priority queue for task execution""" - - def add_task(self, task: Task, priority: int = 0): - """Add task to queue with priority""" - - def get_next_task(self, device: str) -> Optional[Task]: - """Get next task for specific device""" - - def requeue_task(self, task: Task): - """Re-add failed task with lower priority""" -``` - -### 3. Background Task Processing - -#### Modified: Background Task Wrapper -```python -async def process_document_task(task_id: str, background_tasks: BackgroundTasks): - """Enhanced background task with cleanup""" - - # Register cleanup callback - def cleanup(): - asyncio.create_task(cleanup_task_resources(task_id)) - - background_tasks.add_task( - _process_with_cleanup, - task_id, - on_complete=cleanup, - on_error=cleanup - ) -``` - -#### Added: Task Resource Cleanup -```python -async def cleanup_task_resources(task_id: str): - """Release all resources associated with task""" - - Clear task-specific caches - - Release temporary files - - Update resource tracking - - Log cleanup completion -``` - -### 4. Task Monitoring - -#### Added: Task Metrics Endpoint -```python -@router.get("/tasks/metrics") -async def get_task_metrics(): - return { - "active_tasks": {...}, - "queued_tasks": {...}, - "memory_by_device": {...}, - "pool_utilization": {...}, - "average_wait_time": ... - } -``` - -#### Added: Task Health Checks -```python -@router.get("/tasks/{task_id}/health") -async def get_task_health(task_id: str): - return { - "status": "...", - "memory_usage_mb": ..., - "processing_time_s": ..., - "device": "...", - "warnings": [...] - } -``` - -### 5. Error Handling - -#### Added: Memory-Based Error Recovery -```python -class TaskErrorHandler: - async def handle_oom_error(self, task: Task): - """Handle out-of-memory errors""" - - Log memory state at failure - - Attempt CPU fallback if configured - - Requeue with reduced batch size - - Alert monitoring system -``` - -#### Modified: Task Failure Reasons -```python -class TaskFailureReason(Enum): - # Existing reasons... - - # New memory-related reasons - OUT_OF_MEMORY = "out_of_memory" - POOL_EXHAUSTED = "pool_exhausted" - DEVICE_UNAVAILABLE = "device_unavailable" - MEMORY_LIMIT_EXCEEDED = "memory_limit_exceeded" -``` - -### 6. Task Lifecycle Events - -#### Added: Resource Events -```python -class TaskEvent(Enum): - # Existing events... - - # New resource events - RESOURCE_ACQUIRED = "resource_acquired" - RESOURCE_RELEASED = "resource_released" - MEMORY_WARNING = "memory_warning" - CLEANUP_STARTED = "cleanup_started" - CLEANUP_COMPLETED = "cleanup_completed" -``` - -#### Added: Event Handlers -```python -async def on_task_resource_acquired(task_id: str, resource: Dict): - """Log and track resource acquisition""" - -async def on_task_cleanup_completed(task_id: str): - """Verify cleanup and update status""" -``` - -## Database Schema Changes - -### Task Table Updates -```sql -ALTER TABLE tasks ADD COLUMN estimated_memory_mb INTEGER; -ALTER TABLE tasks ADD COLUMN actual_memory_mb INTEGER; -ALTER TABLE tasks ADD COLUMN assigned_device VARCHAR(50); -ALTER TABLE tasks ADD COLUMN service_instance_id VARCHAR(100); -ALTER TABLE tasks ADD COLUMN resource_cleanup_completed BOOLEAN DEFAULT FALSE; -``` - -### New Tables -```sql -CREATE TABLE task_metrics ( - id SERIAL PRIMARY KEY, - task_id VARCHAR(36) REFERENCES tasks(id), - timestamp TIMESTAMP, - memory_usage_mb INTEGER, - device VARCHAR(50), - processing_stage VARCHAR(100) -); - -CREATE TABLE task_events ( - id SERIAL PRIMARY KEY, - task_id VARCHAR(36) REFERENCES tasks(id), - event_type VARCHAR(50), - timestamp TIMESTAMP, - details JSONB -); -``` - -## Configuration Changes - -### Added Task Settings -```yaml -tasks: - max_queue_size: 100 - queue_timeout_seconds: 300 - enable_priority_queue: true - enable_resource_tracking: true - cleanup_timeout_seconds: 30 - - retry: - max_attempts: 3 - backoff_multiplier: 2 - memory_reduction_factor: 0.5 -``` - -## Breaking Changes -None - All changes maintain backward compatibility. - -## Migration Requirements -1. Run database migrations to add new columns -2. Deploy updated task router code -3. Configure pool settings based on hardware -4. Enable monitoring endpoints -5. Test cleanup hooks in staging environment \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/design.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/design.md deleted file mode 100644 index 30eeffd..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/design.md +++ /dev/null @@ -1,587 +0,0 @@ -# Design Document: Enhanced Memory Management - -## Architecture Overview - -The enhanced memory management system introduces three core components that work together to prevent OOM crashes and optimize resource utilization: - -``` -┌─────────────────────────────────────────────────────────────┐ -│ Task Router │ -│ ┌──────────────────────────────────────────────────────┐ │ -│ │ Request → Queue → Acquire Service → Process → Release │ │ -│ └──────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ OCRServicePool │ -│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ -│ │Service 1│ │Service 2│ │Service 3│ │Service 4│ │ -│ │ GPU:0 │ │ GPU:0 │ │ GPU:1 │ │ CPU │ │ -│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ ModelManager │ -│ ┌──────────────────────────────────────────────────────┐ │ -│ │ Models: {id → (instance, ref_count, last_used)} │ │ -│ │ Timeout Monitor → Unload Idle Models │ │ -│ └──────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ MemoryGuard │ -│ ┌──────────────────────────────────────────────────────┐ │ -│ │ Monitor: GPU/CPU Memory Usage │ │ -│ │ Actions: Warn → Throttle → Fallback → Emergency │ │ -│ └──────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ -``` - -## Component Design - -### 1. ModelManager - -**Purpose**: Centralized model lifecycle management with reference counting and idle timeout. - -**Key Design Decisions**: -- **Singleton Pattern**: One ModelManager instance per application -- **Reference Counting**: Track active users of each model -- **LRU Cache**: Evict least recently used models when memory pressure -- **Lazy Loading**: Load models only when first requested - -**Implementation**: -```python -class ModelManager: - def __init__(self, config: ModelConfig): - self.models: Dict[str, ModelEntry] = {} - self.lock = asyncio.Lock() - self.config = config - self._start_timeout_monitor() - - async def load_model(self, model_id: str, params: Dict) -> Model: - async with self.lock: - if model_id in self.models: - entry = self.models[model_id] - entry.ref_count += 1 - entry.last_used = time.time() - return entry.model - - # Check memory before loading - if not await self.memory_guard.check_memory(params['estimated_memory']): - await self._evict_idle_models() - - model = await self._create_model(model_id, params) - self.models[model_id] = ModelEntry( - model=model, - ref_count=1, - last_used=time.time() - ) - return model -``` - -### 2. OCRServicePool - -**Purpose**: Manage a pool of OCRService instances to prevent duplicate model loading. - -**Key Design Decisions**: -- **Per-Device Pools**: Separate pool for each GPU/CPU device -- **Semaphore Control**: Limit concurrent usage per service -- **Queue Management**: FIFO queue with timeout for waiting requests -- **Health Monitoring**: Periodic health checks on pooled services - -**Implementation**: -```python -class OCRServicePool: - def __init__(self, config: PoolConfig): - self.pools: Dict[str, List[OCRService]] = {} - self.semaphores: Dict[str, asyncio.Semaphore] = {} - self.queues: Dict[str, asyncio.Queue] = {} - self._initialize_pools() - - async def acquire(self, device: str = "GPU:0") -> OCRService: - # Try to get from pool - if device in self.pools and self.pools[device]: - for service in self.pools[device]: - if await service.try_acquire(): - return service - - # Queue if pool exhausted - return await self._wait_for_service(device) -``` - -### 3. MemoryGuard - -**Purpose**: Monitor memory usage and trigger preventive actions. - -**Key Design Decisions**: -- **Multi-Backend Support**: paddle.device.cuda, pynvml, torch as fallbacks -- **Threshold System**: Warning (80%), Critical (95%), Emergency (98%) -- **Predictive Allocation**: Estimate memory before operations -- **Progressive Actions**: Warn → Throttle → CPU Fallback → Reject - -**Implementation**: -```python -class MemoryGuard: - def __init__(self, config: MemoryConfig): - self.config = config - self.backend = self._detect_backend() - self._start_monitor() - - async def check_memory(self, required_mb: int = 0) -> bool: - stats = await self.get_memory_stats() - available = stats['gpu_free_mb'] - - if available < required_mb: - return False - - usage_ratio = stats['gpu_used_ratio'] - if usage_ratio > self.config.critical_threshold: - await self._trigger_emergency_cleanup() - return False - - if usage_ratio > self.config.warning_threshold: - await self._trigger_warning() - - return True -``` - -## Memory Optimization Strategies - -### 1. PP-StructureV3 Specific Optimizations - -**Problem**: PP-StructureV3 is permanently exempted from unloading (lines 255-267). - -**Solution**: -```python -# Remove exemption -def should_unload_model(model_id: str) -> bool: - # Old: if model_id == "ppstructure_v3": return False - # New: Apply same rules to all models - return True - -# Add proper cleanup -def unload_ppstructure_v3(engine: PPStructureV3): - engine.table_engine = None - engine.text_detector = None - engine.text_recognizer = None - paddle.device.cuda.empty_cache() -``` - -### 2. Batch Processing for Large Documents - -**Strategy**: Process documents in configurable batches to limit memory usage. - -```python -async def process_large_document(doc_path: Path, batch_size: int = 10): - total_pages = get_page_count(doc_path) - - for start_idx in range(0, total_pages, batch_size): - end_idx = min(start_idx + batch_size, total_pages) - - # Process batch - batch_results = await process_pages(doc_path, start_idx, end_idx) - - # Force cleanup between batches - paddle.device.cuda.empty_cache() - gc.collect() - - yield batch_results -``` - -### 3. Selective Feature Disabling - -**Strategy**: Allow disabling memory-intensive features when under pressure. - -```python -class AdaptiveProcessing: - def __init__(self): - self.features = { - 'charts': True, - 'formulas': True, - 'tables': True, - 'layout': True - } - - async def adapt_to_memory(self, available_mb: int): - if available_mb < 1000: - self.features['charts'] = False - self.features['formulas'] = False - if available_mb < 500: - self.features['tables'] = False -``` - -## Concurrency Management - -### 1. Semaphore-Based Limiting - -```python -# Global semaphores -prediction_semaphore = asyncio.Semaphore(2) # Max 2 concurrent predictions -processing_semaphore = asyncio.Semaphore(4) # Max 4 concurrent OCR tasks - -async def predict_with_structure(image, params=None): - async with prediction_semaphore: - # Memory check before prediction - required_mb = estimate_prediction_memory(image.shape) - if not await memory_guard.check_memory(required_mb): - raise MemoryError("Insufficient memory for prediction") - - return await pp_structure.predict(image, params) -``` - -### 2. Queue-Based Task Distribution - -```python -class TaskDistributor: - def __init__(self): - self.queues = { - 'high': asyncio.Queue(maxsize=10), - 'normal': asyncio.Queue(maxsize=50), - 'low': asyncio.Queue(maxsize=100) - } - - async def distribute_task(self, task: Task): - priority = self._calculate_priority(task) - queue = self.queues[priority] - - try: - await asyncio.wait_for( - queue.put(task), - timeout=self.config.queue_timeout - ) - except asyncio.TimeoutError: - raise QueueFullError(f"Queue {priority} is full") -``` - -## Monitoring and Metrics - -### 1. Memory Metrics Collection - -```python -class MemoryMetrics: - def __init__(self): - self.history = deque(maxlen=1000) - self.alerts = [] - - async def collect(self): - stats = { - 'timestamp': time.time(), - 'gpu_used_mb': get_gpu_memory_used(), - 'gpu_free_mb': get_gpu_memory_free(), - 'cpu_used_mb': get_cpu_memory_used(), - 'models_loaded': len(model_manager.models), - 'active_tasks': len(active_tasks), - 'pool_utilization': get_pool_utilization() - } - self.history.append(stats) - await self._check_alerts(stats) -``` - -### 2. Monitoring Dashboard Endpoints - -```python -@router.get("/admin/memory/stats") -async def get_memory_stats(): - return { - 'current': memory_metrics.get_current(), - 'history': memory_metrics.get_history(minutes=5), - 'alerts': memory_metrics.get_active_alerts(), - 'recommendations': memory_optimizer.get_recommendations() - } - -@router.post("/admin/memory/gc") -async def trigger_garbage_collection(): - """Manual garbage collection trigger""" - results = await memory_manager.force_cleanup() - return {'freed_mb': results['freed'], 'models_unloaded': results['models']} -``` - -## Error Recovery - -### 1. OOM Recovery Strategy - -```python -class OOMRecovery: - async def recover(self, error: Exception, task: Task): - logger.error(f"OOM detected for task {task.id}: {error}") - - # Step 1: Emergency cleanup - await self.emergency_cleanup() - - # Step 2: Try CPU fallback - if self.config.enable_cpu_fallback: - task.device = "CPU" - return await self.retry_on_cpu(task) - - # Step 3: Reduce batch size and retry - if task.batch_size > 1: - task.batch_size = max(1, task.batch_size // 2) - return await self.retry_with_reduced_batch(task) - - # Step 4: Fail gracefully - await self.mark_task_failed(task, "Insufficient memory") -``` - -### 2. Service Recovery - -```python -class ServiceRecovery: - async def restart_service(self, service_id: str): - """Restart a failed service""" - # Kill existing process - await self.kill_service_process(service_id) - - # Clear service memory - await self.clear_service_cache(service_id) - - # Restart with fresh state - new_service = await self.create_service(service_id) - await self.pool.replace_service(service_id, new_service) -``` - -## Testing Strategy - -### 1. Memory Leak Detection - -```python -@pytest.mark.memory -async def test_no_memory_leak(): - initial_memory = get_memory_usage() - - # Process 100 tasks - for _ in range(100): - task = create_test_task() - await process_task(task) - - # Force cleanup - await cleanup_all() - gc.collect() - - final_memory = get_memory_usage() - leak = final_memory - initial_memory - - assert leak < 100 # Max 100MB leak tolerance -``` - -### 2. Stress Testing - -```python -@pytest.mark.stress -async def test_concurrent_load(): - tasks = [create_large_task() for _ in range(50)] - - # Should handle gracefully without OOM - results = await asyncio.gather( - *[process_task(t) for t in tasks], - return_exceptions=True - ) - - # Some may fail but system should remain stable - successful = sum(1 for r in results if not isinstance(r, Exception)) - assert successful > 0 - assert await health_check() == "healthy" -``` - -## Performance Targets - -| Metric | Current | Target | Improvement | -|--------|---------|---------|------------| -| Memory per task | 2-4 GB | 0.5-1 GB | 75% reduction | -| Concurrent tasks | 1-2 | 4-8 | 4x increase | -| Model load time | 30-60s | 5-10s (cached) | 6x faster | -| OOM crashes/day | 5-10 | 0-1 | 90% reduction | -| Service uptime | 4-8 hours | 24+ hours | 3x improvement | - -## Rollout Plan - -### Phase 1: Foundation (Week 1) -- Implement ModelManager -- Integrate with existing OCRService -- Add basic memory monitoring - -### Phase 2: Pooling (Week 2) -- Implement OCRServicePool -- Update task router -- Add concurrency limits - -### Phase 3: Optimization (Week 3) -- Add MemoryGuard -- Implement adaptive processing -- Add batch processing - -### Phase 4: Hardening (Week 4) -- Stress testing -- Performance tuning -- Documentation and monitoring - -## Configuration Settings Reference - -All memory management settings are defined in `backend/app/core/config.py` under the `Settings` class. - -### Memory Thresholds - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `memory_warning_threshold` | float | 0.80 | GPU memory usage ratio (0-1) to trigger warning alerts | -| `memory_critical_threshold` | float | 0.95 | GPU memory ratio to start throttling operations | -| `memory_emergency_threshold` | float | 0.98 | GPU memory ratio to trigger emergency cleanup | - -### Memory Monitoring - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `memory_check_interval_seconds` | int | 30 | Background check interval for memory monitoring | -| `enable_memory_alerts` | bool | True | Enable/disable memory threshold alerts | -| `gpu_memory_limit_mb` | int | 6144 | Maximum GPU memory to use (MB) | -| `gpu_memory_reserve_mb` | int | 512 | Memory reserved for CUDA overhead | - -### Model Lifecycle Management - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `enable_model_lifecycle_management` | bool | True | Use ModelManager for model lifecycle | -| `model_idle_timeout_seconds` | int | 300 | Unload models after idle time | -| `pp_structure_idle_timeout_seconds` | int | 300 | Unload PP-StructureV3 after idle | -| `structure_model_memory_mb` | int | 2000 | Estimated memory for PP-StructureV3 | -| `ocr_model_memory_mb` | int | 500 | Estimated memory per OCR language model | -| `enable_lazy_model_loading` | bool | True | Load models on demand | -| `auto_unload_unused_models` | bool | True | Auto-unload unused language models | - -### Service Pool Configuration - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `enable_service_pool` | bool | True | Use OCRServicePool | -| `max_services_per_device` | int | 1 | Max OCRService instances per GPU | -| `max_total_services` | int | 2 | Max total OCRService instances | -| `service_acquire_timeout_seconds` | float | 300.0 | Timeout for acquiring service from pool | -| `max_queue_size` | int | 50 | Max pending tasks per device queue | - -### Concurrency Control - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `max_concurrent_predictions` | int | 2 | Max concurrent PP-StructureV3 predictions | -| `max_concurrent_pages` | int | 2 | Max pages processed concurrently | -| `inference_batch_size` | int | 1 | Batch size for inference | -| `enable_batch_processing` | bool | True | Enable batch processing for large docs | - -### Recovery Settings - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `enable_cpu_fallback` | bool | True | Fall back to CPU when GPU memory low | -| `enable_emergency_cleanup` | bool | True | Auto-cleanup on memory pressure | -| `enable_worker_restart` | bool | False | Restart workers on OOM (requires supervisor) | - -### Feature Flags - -| Setting | Type | Default | Description | -|---------|------|---------|-------------| -| `enable_chart_recognition` | bool | True | Enable chart/diagram recognition | -| `enable_formula_recognition` | bool | True | Enable math formula recognition | -| `enable_table_recognition` | bool | True | Enable table structure recognition | -| `enable_seal_recognition` | bool | True | Enable seal/stamp recognition | -| `enable_text_recognition` | bool | True | Enable general text recognition | -| `enable_memory_optimization` | bool | True | Enable memory optimizations | - -### Environment Variable Override - -All settings can be overridden via environment variables. The format is uppercase with underscores: - -```bash -# Example .env file -MEMORY_WARNING_THRESHOLD=0.75 -MEMORY_CRITICAL_THRESHOLD=0.90 -MAX_CONCURRENT_PREDICTIONS=1 -GPU_MEMORY_LIMIT_MB=4096 -ENABLE_CPU_FALLBACK=true -``` - -### Recommended Configurations - -#### RTX 4060 8GB (Default) -```bash -GPU_MEMORY_LIMIT_MB=6144 -MAX_CONCURRENT_PREDICTIONS=2 -MAX_CONCURRENT_PAGES=2 -INFERENCE_BATCH_SIZE=1 -``` - -#### RTX 3090 24GB -```bash -GPU_MEMORY_LIMIT_MB=20480 -MAX_CONCURRENT_PREDICTIONS=4 -MAX_CONCURRENT_PAGES=4 -INFERENCE_BATCH_SIZE=2 -``` - -#### CPU-Only Mode -```bash -FORCE_CPU_MODE=true -MAX_CONCURRENT_PREDICTIONS=1 -ENABLE_CPU_FALLBACK=false -``` - -## Prometheus Metrics - -The system exports Prometheus-format metrics via the `PrometheusMetrics` class. Available metrics: - -### GPU Metrics -- `tool_ocr_memory_gpu_total_bytes` - Total GPU memory -- `tool_ocr_memory_gpu_used_bytes` - Used GPU memory -- `tool_ocr_memory_gpu_free_bytes` - Free GPU memory -- `tool_ocr_memory_gpu_utilization_ratio` - GPU utilization (0-1) - -### Model Metrics -- `tool_ocr_memory_models_loaded_total` - Number of loaded models -- `tool_ocr_memory_models_memory_bytes` - Total memory used by models -- `tool_ocr_memory_model_ref_count{model_id}` - Reference count per model - -### Prediction Metrics -- `tool_ocr_memory_predictions_active` - Currently active predictions -- `tool_ocr_memory_predictions_queue_depth` - Predictions waiting in queue -- `tool_ocr_memory_predictions_total` - Total predictions processed (counter) -- `tool_ocr_memory_predictions_timeouts_total` - Total prediction timeouts (counter) - -### Pool Metrics -- `tool_ocr_memory_pool_services_total` - Total services in pool -- `tool_ocr_memory_pool_services_available` - Available services -- `tool_ocr_memory_pool_services_in_use` - Services in use -- `tool_ocr_memory_pool_acquisitions_total` - Total acquisitions (counter) - -### Recovery Metrics -- `tool_ocr_memory_recovery_count_total` - Total recovery attempts -- `tool_ocr_memory_recovery_in_cooldown` - In cooldown (0/1) -- `tool_ocr_memory_recovery_cooldown_remaining_seconds` - Remaining cooldown - -## Memory Dump API - -The `MemoryDumper` class provides debugging capabilities: - -```python -from app.services.memory_manager import get_memory_dumper - -dumper = get_memory_dumper() - -# Create a memory dump -dump = dumper.create_dump(include_python_objects=True) - -# Get dump as dictionary for JSON serialization -dump_dict = dumper.to_dict(dump) - -# Compare two dumps to detect memory growth -comparison = dumper.compare_dumps(dump1, dump2) -``` - -Memory dumps include: -- GPU/CPU memory usage -- Loaded models and reference counts -- Active predictions and queue state -- Service pool statistics -- Recovery manager state -- Python GC statistics -- Large Python objects (optional) \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/proposal.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/proposal.md deleted file mode 100644 index 44636b1..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/proposal.md +++ /dev/null @@ -1,77 +0,0 @@ -# Change: Enhanced Memory Management for OCR Services - -## Why - -The current OCR service architecture suffers from critical memory management issues that lead to GPU memory exhaustion, service instability, and degraded performance under load: - -1. **Memory Leaks**: PP-StructureV3 models are permanently exempted from unloading (lines 255-267), causing VRAM to remain occupied indefinitely. - -2. **Instance Proliferation**: Each task creates a new OCRService instance (tasks.py lines 44-65), leading to duplicate model loading and memory fragmentation. - -3. **Inadequate Memory Monitoring**: `check_gpu_memory()` always returns True in Paddle-only environments, providing no actual memory protection. - -4. **Uncontrolled Concurrency**: No limits on simultaneous PP-StructureV3 predictions, causing memory spikes. - -5. **No Resource Cleanup**: Tasks complete without releasing GPU memory, leading to accumulated memory usage. - -These issues cause service crashes, require frequent restarts, and prevent scaling to handle multiple concurrent requests. - -## What Changes - -### 1. Model Lifecycle Management -- **NEW**: `ModelManager` class to handle model loading/unloading with reference counting -- **NEW**: Idle timeout mechanism for PP-StructureV3 (same as language models) -- **NEW**: Explicit `teardown()` method for end-of-flow cleanup -- **MODIFIED**: OCRService to use managed model instances - -### 2. Service Singleton Pattern -- **NEW**: `OCRServicePool` to manage OCRService instances (one per GPU/device) -- **NEW**: Queue-based task distribution with concurrency limits -- **MODIFIED**: Task router to use pooled services instead of creating new instances - -### 3. Enhanced Memory Monitoring -- **NEW**: `MemoryGuard` class using paddle.device.cuda memory APIs -- **NEW**: Support for pynvml/torch as fallback memory query methods -- **NEW**: Memory threshold configuration (warning/critical levels) -- **MODIFIED**: Processing logic to degrade gracefully when memory is low - -### 4. Concurrency Control -- **NEW**: Semaphore-based limits for PP-StructureV3 predictions -- **NEW**: Configuration to disable/delay chart/formula/table analysis -- **NEW**: Batch processing mode for large documents - -### 5. Active Memory Management -- **NEW**: Background memory monitor thread with metrics collection -- **NEW**: Automatic cache clearing when thresholds exceeded -- **NEW**: Model unloading based on LRU policy -- **NEW**: Worker process restart capability when memory cannot be recovered - -### 6. Cleanup Hooks -- **NEW**: Global shutdown handlers for graceful cleanup -- **NEW**: Task completion callbacks to release resources -- **MODIFIED**: Background task wrapper to ensure cleanup on success/failure - -## Impact - -**Affected specs**: -- `ocr-processing` - Model management and processing flow -- `task-management` - Task execution and resource management - -**Affected code**: -- `backend/app/services/ocr_service.py` - Major refactoring for memory management -- `backend/app/routers/tasks.py` - Use service pool instead of new instances -- `backend/app/core/config.py` - New memory management settings -- `backend/app/services/memory_manager.py` - NEW file -- `backend/app/services/service_pool.py` - NEW file - -**Breaking changes**: None - All changes are internal optimizations - -**Migration**: Existing deployments will benefit immediately with no configuration changes required. Optional tuning parameters available for optimization. - -## Testing Requirements - -1. **Memory leak tests** - Verify models are properly unloaded -2. **Concurrency tests** - Validate semaphore limits work correctly -3. **Stress tests** - Ensure system degrades gracefully under memory pressure -4. **Integration tests** - Verify pooled services work correctly -5. **Performance benchmarks** - Measure memory usage improvements \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/specs/memory-management/spec.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/specs/memory-management/spec.md deleted file mode 100644 index 64fad02..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/specs/memory-management/spec.md +++ /dev/null @@ -1,104 +0,0 @@ -# Memory Management Specification - -## ADDED Requirements - -### Requirement: Model Manager -The system SHALL provide a ModelManager class that manages model lifecycle with reference counting and idle timeout mechanisms. - -#### Scenario: Loading a model -GIVEN a request to load a model -WHEN the model is not already loaded -THEN the ModelManager creates a new instance and sets reference count to 1 - -#### Scenario: Reusing loaded model -GIVEN a model is already loaded -WHEN another request for the same model arrives -THEN the ModelManager returns the existing instance and increments reference count - -#### Scenario: Unloading idle model -GIVEN a model with zero reference count -WHEN the idle timeout period expires -THEN the ModelManager unloads the model and frees memory - -### Requirement: Service Pool -The system SHALL implement an OCRServicePool that manages a pool of OCRService instances with one instance per GPU/CPU device. - -#### Scenario: Acquiring service from pool -GIVEN a task needs processing -WHEN a service is requested from the pool -THEN the pool returns an available service or queues the request if all services are busy - -#### Scenario: Releasing service to pool -GIVEN a task has completed processing -WHEN the service is released -THEN the service becomes available for other tasks in the pool - -### Requirement: Memory Monitoring -The system SHALL continuously monitor GPU and CPU memory usage and trigger preventive actions based on configurable thresholds. - -#### Scenario: Memory warning threshold -GIVEN memory usage reaches 80% (warning threshold) -WHEN a new task is requested -THEN the system logs a warning and may defer non-critical operations - -#### Scenario: Memory critical threshold -GIVEN memory usage reaches 95% (critical threshold) -WHEN a new task is requested -THEN the system attempts CPU fallback or rejects the task - -### Requirement: Concurrency Control -The system SHALL limit concurrent PP-StructureV3 predictions using semaphores to prevent memory exhaustion. - -#### Scenario: Concurrent prediction limit -GIVEN the maximum concurrent predictions is set to 2 -WHEN 2 predictions are already running -THEN additional prediction requests wait in queue until a slot becomes available - -### Requirement: Resource Cleanup -The system SHALL ensure all resources are properly cleaned up after task completion or failure. - -#### Scenario: Successful task cleanup -GIVEN a task completes successfully -WHEN the task finishes -THEN all allocated memory, temporary files, and model references are released - -#### Scenario: Failed task cleanup -GIVEN a task fails with an error -WHEN the error handler runs -THEN cleanup is performed in the finally block regardless of failure reason - -## MODIFIED Requirements - -### Requirement: OCR Service Instantiation -The OCR service instantiation SHALL use pooled instances instead of creating new instances for each task. - -#### Scenario: Task using pooled service -GIVEN a new OCR task arrives -WHEN the task starts processing -THEN it acquires a service from the pool instead of creating a new instance - -### Requirement: PP-StructureV3 Model Management -The PP-StructureV3 model SHALL be subject to the same lifecycle management as other models, removing its permanent exemption from unloading. - -#### Scenario: PP-StructureV3 unloading -GIVEN PP-StructureV3 has been idle for the configured timeout -WHEN memory pressure is detected -THEN the model can be unloaded to free memory - -### Requirement: Task Resource Tracking -Tasks SHALL track their resource usage including estimated and actual memory consumption. - -#### Scenario: Task memory tracking -GIVEN a task is processing -WHEN memory metrics are collected -THEN the task records both estimated and actual memory usage for analysis - -## REMOVED Requirements - -### Requirement: Permanent Model Loading -The requirement for PP-StructureV3 to remain permanently loaded SHALL be removed. - -#### Scenario: Dynamic model loading -GIVEN the system starts -WHEN no tasks are using PP-StructureV3 -THEN the model is not loaded until first use \ No newline at end of file diff --git a/openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md b/openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md deleted file mode 100644 index f70453b..0000000 --- a/openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md +++ /dev/null @@ -1,176 +0,0 @@ -# Tasks for Enhanced Memory Management - -## Section 1: Model Lifecycle Management (Priority: Critical) - -### 1.1 Create ModelManager class -- [x] Design ModelManager interface with load/unload/get methods -- [x] Implement reference counting for model instances -- [x] Add idle timeout tracking with configurable thresholds -- [x] Create teardown() method for explicit cleanup -- [x] Add logging for model lifecycle events - -### 1.2 Integrate PP-StructureV3 with ModelManager -- [x] Remove permanent exemption from unloading (lines 255-267) -- [x] Wrap PP-StructureV3 in managed model wrapper -- [x] Implement lazy loading on first access -- [x] Add unload capability with cache clearing -- [x] Test model reload after unload - -## Section 2: Service Singleton Pattern (Priority: Critical) - -### 2.1 Create OCRServicePool -- [x] Design pool interface with acquire/release methods -- [x] Implement per-device instance management -- [x] Add queue-based task distribution -- [x] Implement concurrency limits via semaphores -- [x] Add health check for pooled instances - -### 2.2 Refactor task router -- [x] Replace OCRService() instantiation with pool.acquire() -- [x] Add proper release in finally blocks -- [x] Handle pool exhaustion gracefully -- [x] Add metrics for pool utilization -- [x] Update error handling for pooled services - -## Section 3: Enhanced Memory Monitoring (Priority: High) - -### 3.1 Create MemoryGuard class -- [x] Implement paddle.device.cuda memory queries -- [x] Add pynvml integration as fallback -- [x] Add torch memory query support -- [x] Create configurable threshold system -- [x] Implement memory prediction for operations - -### 3.2 Integrate memory checks -- [x] Replace existing check_gpu_memory implementation -- [x] Add pre-operation memory checks -- [x] Implement CPU fallback when memory low -- [x] Add memory usage logging -- [x] Create memory pressure alerts - -## Section 4: Concurrency Control (Priority: High) - -### 4.1 Implement prediction semaphores -- [x] Add semaphore for PP-StructureV3.predict -- [x] Configure max concurrent predictions -- [x] Add queue for waiting predictions -- [x] Implement timeout handling -- [x] Add metrics for queue depth - -### 4.2 Add selective processing -- [x] Create config for disabling chart/formula/table -- [x] Implement batch processing for large documents -- [x] Add progressive loading for multi-page docs -- [x] Create priority queue for operations -- [x] Test memory savings with selective processing - -## Section 5: Active Memory Management (Priority: Medium) - -### 5.1 Create memory monitor thread -- [x] Implement background monitoring loop -- [x] Add periodic memory metrics collection -- [x] Create threshold-based triggers -- [x] Implement automatic cache clearing -- [x] Add LRU-based model unloading - -### 5.2 Add recovery mechanisms -- [x] Implement emergency memory release -- [x] Add worker process restart capability (RecoveryManager) -- [x] Create memory dump for debugging -- [x] Add cooldown period after recovery -- [x] Test recovery under various scenarios - -## Section 6: Cleanup Hooks (Priority: Medium) - -### 6.1 Implement shutdown handlers -- [x] Add FastAPI shutdown event handler -- [x] Create signal handlers (SIGTERM, SIGINT) -- [x] Implement graceful model unloading -- [x] Add connection draining -- [x] Test shutdown sequence - -### 6.2 Add task cleanup -- [x] Wrap background tasks with cleanup -- [x] Add success/failure callbacks -- [x] Implement resource release on completion -- [x] Add cleanup verification logging -- [x] Test cleanup in error scenarios - -## Section 7: Configuration & Settings (Priority: Low) - -### 7.1 Add memory settings to config -- [x] Define memory threshold parameters -- [x] Add model timeout settings -- [x] Configure pool sizes -- [x] Add feature flags for new behavior -- [x] Document all settings - -### 7.2 Create monitoring dashboard -- [x] Add memory metrics endpoint -- [x] Create pool status endpoint -- [x] Add model lifecycle stats -- [x] Implement health check endpoint -- [x] Add Prometheus metrics export - -## Section 8: Testing & Documentation (Priority: High) - -### 8.1 Create comprehensive tests -- [x] Unit tests for ModelManager -- [x] Integration tests for OCRServicePool -- [x] Memory leak detection tests -- [x] Stress tests with concurrent requests -- [x] Performance benchmarks - -### 8.2 Documentation -- [ ] Document memory management architecture -- [ ] Create tuning guide -- [ ] Add troubleshooting section -- [ ] Document monitoring setup -- [ ] Create migration guide - ---- - -**Total Tasks**: 58 -**Completed**: 53 -**Remaining**: 5 (Section 8.2 Documentation only) -**Progress**: ~91% - -**Critical Path Status**: Sections 1-8.1 are completed (foundation, memory monitoring, prediction semaphores, batch processing, recovery, signal handlers, configuration, Prometheus metrics, and comprehensive tests in place) - -## Implementation Summary - -### Files Created -- `backend/app/services/memory_manager.py` - ModelManager, MemoryGuard, MemoryConfig, PredictionSemaphore, BatchProcessor, ProgressiveLoader, PriorityOperationQueue, RecoveryManager -- `backend/app/services/service_pool.py` - OCRServicePool, PoolConfig -- `backend/tests/services/test_memory_manager.py` - Unit tests for memory management (57 tests) -- `backend/tests/services/test_service_pool.py` - Unit tests for service pool (18 tests) -- `backend/tests/services/test_ocr_memory_integration.py` - Integration tests for memory check patterns (10 tests) - -### Files Modified -- `backend/app/core/config.py` - Added memory management configuration settings -- `backend/app/services/ocr_service.py` - Removed PP-StructureV3 exemption, added unload capability, integrated MemoryGuard for pre-operation checks and CPU fallback, added PredictionSemaphore for concurrent prediction control -- `backend/app/services/pp_structure_enhanced.py` - Added PredictionSemaphore control for predict calls -- `backend/app/routers/tasks.py` - Refactored to use service pool -- `backend/app/main.py` - Added startup/shutdown handlers, signal handlers (SIGTERM/SIGINT), connection draining, recovery manager shutdown - -### New Classes Added (Section 4.2-8) -- `BatchProcessor` - Memory-aware batch processing for large documents with priority support -- `ProgressiveLoader` - Progressive page loading with lookahead and automatic cleanup -- `PriorityOperationQueue` - Priority queue with timeout and cancellation support -- `RecoveryManager` - Memory recovery with cooldown period and attempt limits -- `MemoryDumper` - Memory dump creation for debugging with history and comparison -- `PrometheusMetrics` - Prometheus-format metrics export for monitoring -- Signal handlers for graceful shutdown (SIGTERM, SIGINT) -- Connection draining for clean shutdown - -### New Test Classes Added (Section 8.1) -- `TestModelReloadAfterUnload` - Tests for model reload after unload -- `TestSelectiveProcessingMemorySavings` - Tests for memory savings with selective processing -- `TestRecoveryScenarios` - Tests for recovery under various scenarios -- `TestShutdownSequence` - Tests for shutdown sequence -- `TestCleanupInErrorScenarios` - Tests for cleanup in error scenarios -- `TestMemoryLeakDetection` - Tests for memory leak detection -- `TestStressConcurrentRequests` - Stress tests with concurrent requests -- `TestPerformanceBenchmarks` - Performance benchmark tests -- `TestMemoryDumper` - Tests for MemoryDumper class -- `TestPrometheusMetrics` - Tests for PrometheusMetrics class diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md deleted file mode 100644 index 8a56b24..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md +++ /dev/null @@ -1,28 +0,0 @@ -# Change: Fix OCR Track Table Empty Columns and Alignment - -## Why - -PP-Structure 生成的表格經常包含空白欄位(所有 row 該欄皆為空/空白),導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料,未進行清理,影響 PDF/JSON/Markdown 輸出品質。 - -## What Changes - -- 新增 `trim_empty_columns()` 函數,清理 OCR Track 表格的空欄 -- 在 `_convert_table_data` 入口調用清洗邏輯,確保 TableData 乾淨 -- 處理 col_span 重算:若 span 跨過被移除欄位,縮小 span -- 更新 columns/cols 數值、調整各 cell 的 col 索引 -- 可選:依 bbox x0 進行欄對齊排序 - -## Impact - -- Affected specs: `ocr-processing` -- Affected code: - - `backend/app/services/ocr_to_unified_converter.py` (主要修改) -- 不影響 Direct/HYBRID 路徑 -- PDF/JSON/Markdown 輸出將更乾淨 - -## Constraints - -- 保持表格 bbox、頁面座標不變 -- 不修改 Direct/HYBRID 路徑 -- 只移除「所有行皆空」的欄;若表頭空但數據有值,不應移除 -- 保留原 bbox,避免 PDF 版面漂移 diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md deleted file mode 100644 index a28cc4f..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md +++ /dev/null @@ -1,61 +0,0 @@ -## ADDED Requirements - -### Requirement: OCR Table Empty Column Cleanup - -The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content. - -The system SHALL: -1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness) -2. Remove identified empty columns from the table structure -3. Update the `columns`/`cols` value to reflect the new column count -4. Recalculate each cell's `col` index to maintain continuity -5. Adjust `col_span` values when spans cross removed columns (shrink span size) -6. Remove cells entirely when their complete span falls within removed columns -7. Preserve original bbox and page coordinates (no layout drift) -8. If `columns` is 0 or missing after cleanup, fill with the calculated column count - -The cleanup SHALL NOT: -- Remove columns where the header is empty but data rows contain values -- Modify tables in Direct or HYBRID track -- Alter the original bbox coordinates - -#### Scenario: All rows in column are empty -- **WHEN** a table has a column where all cells contain only empty or whitespace content -- **THEN** that column is removed -- **AND** remaining cells have their `col` indices decremented appropriately -- **AND** `cols` count is reduced by 1 - -#### Scenario: Column has empty header but data has values -- **WHEN** a table has a column where the header cell is empty -- **AND** at least one data row cell in that column contains non-whitespace content -- **THEN** that column is NOT removed - -#### Scenario: Cell span crosses removed column -- **WHEN** a cell has `col_span > 1` -- **AND** one or more columns within the span are removed -- **THEN** the `col_span` is reduced by the number of removed columns within the span - -#### Scenario: Cell span entirely within removed columns -- **WHEN** a cell's entire span falls within columns that are all removed -- **THEN** that cell is removed from the table - -#### Scenario: Missing columns metadata -- **WHEN** the table dict has `columns` set to 0 or missing -- **AFTER** cleanup is performed -- **THEN** `columns` is set to the calculated number of remaining columns - -### Requirement: OCR Table Column Alignment by Bbox - -(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy. - -The system SHALL: -1. Sort cells by bbox `x0` coordinate before assigning column indices -2. Reassign `col` indices based on spatial position rather than HTML order - -This requirement is optional and implementation MAY be deferred if bbox data is not reliably available. - -#### Scenario: Cells reordered by bbox position -- **WHEN** bbox coordinates are available for table cells -- **AND** the original HTML order does not match spatial order -- **THEN** cells are reordered by `x0` coordinate -- **AND** `col` indices are reassigned to reflect spatial positioning diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md deleted file mode 100644 index cfa583f..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md +++ /dev/null @@ -1,43 +0,0 @@ -# Tasks: Fix OCR Track Table Empty Columns - -## 1. Core Implementation - -- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]` - - 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」 - - 使用 `.strip()` 判斷空白字元 -- [x] 1.2 實作欄位移除邏輯 - - 更新 columns/cols 數值 - - 調整各 cell 的 col 索引 -- [x] 1.3 實作 col_span 重算邏輯 - - 若 span 跨過被移除欄位,縮小 span - - 若整個 span 落在被刪欄位上,移除該 cell -- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns` - - 在建 TableData 之前執行清洗 - - 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗 -- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序 - - 若可取得 bbox 網格,先依 x0 排序再重排 col index - - 此功能延後實作,待 bbox 資料確認可用性後進行 - -## 2. Testing & Validation - -- [x] 2.1 單元測試通過 - - 測試基本空欄移除 - - 測試表頭空但數據有值(不移除) - - 測試 col_span 跨越被移除欄位(縮小 span) - - 測試 cell 完全落在被移除欄位(移除 cell) - - 測試無空欄情況(不變更) -- [x] 2.2 檢查現有 OCR 結果 - - 現有結果中無「整欄為空」的表格 - - 實作已就緒,遇到空欄時會正確清理 -- [x] 2.3 確認 Direct/HYBRID 表格不變 - - `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用 - - Direct 軌使用 `DirectExtractionEngine`,不受影響 - -## 3. Edge Cases & Validation - -- [x] 3.1 處理 columns 欄位為 0/缺失的情況 - - 以計算後的欄數回填,避免 downstream 依賴出錯 -- [x] 3.2 處理表頭為空但數據有值的情況 - - 只移除「所有行皆空」的欄 -- [x] 3.3 確保不直接修改 `backend/storage/results/...` - - 修改 converter,需重新跑任務驗證 diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md deleted file mode 100644 index 6ff7380..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md +++ /dev/null @@ -1,173 +0,0 @@ -# Design: Fix OCR Track Table Data Format - -## Context - -The OCR processing pipeline has three modes: -1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py` -2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py` -3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only - -Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures. - -## Track Isolation Analysis (Safety Guarantee) - -This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode. - -### Code Flow Analysis - -``` -┌─────────────────────────────────────────────────────────────────────────┐ -│ ocr_service.py │ -├─────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Direct Track ──► DirectExtractionEngine ──► UnifiedDocument │ -│ (direct_extraction_engine.py) (tables: TableData ✓) │ -│ [NOT MODIFIED] │ -│ │ -│ OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│ -│ (ocr_to_unified_converter.py) │ -│ [MODIFIED: _extract_table_data] │ -│ │ -│ Hybrid Mode ──► Direct Track (primary) + OCR Track (images only) │ -│ │ │ │ -│ │ └──► _merge_ocr_images_into_ │ -│ │ direct() merges ONLY: │ -│ │ - ElementType.FIGURE │ -│ │ - ElementType.IMAGE │ -│ │ - ElementType.LOGO │ -│ │ [Tables NOT merged] │ -│ └──► Tables come from Direct Track (unchanged) │ -└─────────────────────────────────────────────────────────────────────────┘ -``` - -### Evidence from ocr_service.py - -**Line 1610** (Hybrid mode merge logic): -```python -image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO} -``` - -**Lines 1634-1635** (Only image types are merged): -```python -for element in ocr_page.elements: - if element.type in image_types: # Tables excluded -``` - -### Impact Matrix - -| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? | -|------|--------------|----------------------------|---------------------| -| Direct Track | `DirectExtractionEngine` | No | **No** | -| OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** | -| Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** | - -### Conclusion - -The fix is **isolated to OCR Track only**: -- Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected -- Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction -- OCR Track: Will benefit from the fix with proper `TableData` output - -## Goals / Non-Goals - -### Goals -- OCR Track table output format matches Direct Track format exactly -- PDF Generator receives consistent `TableData` objects from both tracks -- Robust HTML table parsing that handles real-world OCR output - -### Non-Goals -- Modifying Direct Track behavior (it's the reference implementation) -- Changing the `TableData` or `TableCell` data models -- Modifying PDF Generator to handle HTML strings as a workaround - -## Decisions - -### Decision 1: Use BeautifulSoup for HTML Parsing - -**Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides: -- Robust handling of malformed HTML (common in OCR output) -- Easy extraction of cell content, attributes (rowspan, colspan) -- Well-tested library already used in many Python projects - -**Alternatives considered**: -- Manual regex parsing: Too fragile for complex tables -- lxml: More complex API, overkill for this use case -- html.parser (stdlib): Less tolerant of malformed HTML - -### Decision 2: Maintain Backward Compatibility - -**Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken. - -### Decision 3: Single Point of Change - -**Rationale**: Only modify `ocr_to_unified_converter.py`. This: -- Minimizes regression risk -- Keeps Direct Track untouched as reference -- Requires no changes to downstream PDF Generator - -## Implementation Approach - -```python -def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]: - """Extract table data from element using BeautifulSoup.""" - try: - html = elem_data.get('html', '') or elem_data.get('content', '') - if not html or ' elements - if row_idx == 0 or cell.name == 'th': - headers.append(cell_content) - - return TableData( - rows=len(rows), - cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0, - cells=cells, - headers=headers if headers else None - ) - except Exception as e: - logger.warning(f"Failed to parse HTML table: {e}") - return None # Fallback handled by caller -``` - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency | -| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior | -| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast | -| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed | - -## Dependencies - -- `beautifulsoup4`: Already commonly available, add to requirements.txt if not present - -## Open Questions - -- Q: Should we preserve the original HTML in metadata for debugging? - - A: Optional enhancement; not required for initial fix diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md deleted file mode 100644 index 6e89cfa..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md +++ /dev/null @@ -1,45 +0,0 @@ -# Change: Fix OCR Track Table Data Format to Match Direct Track - -## Why - -OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables. - -## What Changes - -- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays -- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables -- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format - -## Impact - -- Affected specs: `ocr-processing` -- Affected code: - - `backend/app/services/ocr_to_unified_converter.py` (primary changes) - - `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`) - - `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation) - -## Evidence - -### Direct Track (Reference - Correct Behavior) -`direct_extraction_engine.py:846-850`: -```python -table_data = TableData( - rows=len(data), - cols=max(len(row) for row in data) if data else 0, - cells=cells, # Properly populated with TableCell objects - headers=data[0] if data else None -) -``` - -### OCR Track (Current - Problematic) -`ocr_to_unified_converter.py:574-579`: -```python -return TableData( - rows=rows, # Only counts from html.count('/ in first row - cells=cells, # Always empty list [] - caption=extracted_text -) -``` - -The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content. diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md deleted file mode 100644 index ebc701c..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md +++ /dev/null @@ -1,51 +0,0 @@ -## ADDED Requirements - -### Requirement: OCR Track Table Data Structure Consistency -The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks. - -#### Scenario: OCR Track produces structured TableData for HTML tables -- **GIVEN** a document with tables is processed via OCR Track -- **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field -- **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object -- **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell -- **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values -- **AND** the output format SHALL match Direct Track's `TableData` structure - -#### Scenario: OCR Track handles tables with merged cells -- **GIVEN** an HTML table with `rowspan` or `colspan` attributes -- **WHEN** the table is converted to `TableData` -- **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values -- **AND** the cell content SHALL be correctly extracted - -#### Scenario: OCR Track handles header rows -- **GIVEN** an HTML table with `` elements or a header row -- **WHEN** the table is converted to `TableData` -- **THEN** the `TableData.headers` field SHALL contain the header cell contents -- **AND** header cells SHALL also be included in the `cells` array - -#### Scenario: OCR Track gracefully handles malformed HTML tables -- **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting) -- **WHEN** parsing is attempted -- **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser -- **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts -- **AND** SHALL log a warning for debugging purposes - -#### Scenario: PDF Generator renders OCR Track tables correctly -- **GIVEN** a `UnifiedDocument` from OCR Track containing table elements -- **WHEN** the PDF Generator processes the document -- **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text) -- **AND** the rendering SHALL be identical to Direct Track table rendering - -#### Scenario: Direct Track table processing remains unchanged -- **GIVEN** a native PDF with embedded tables -- **WHEN** the document is processed via Direct Track -- **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before -- **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing -- **AND** table rendering in PDF output SHALL be identical to pre-fix behavior - -#### Scenario: Hybrid Mode table source isolation -- **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images) -- **WHEN** the system merges OCR Track results into Direct Track results -- **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track -- **AND** table elements SHALL exclusively come from Direct Track -- **AND** no OCR Track table data SHALL contaminate the final output diff --git a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/tasks.md b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/tasks.md deleted file mode 100644 index df303f8..0000000 --- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/tasks.md +++ /dev/null @@ -1,43 +0,0 @@ -# Tasks: Fix OCR Track Table Data Format - -## 1. Implementation - -- [x] 1.1 Add BeautifulSoup import and dependency check in `ocr_to_unified_converter.py` -- [x] 1.2 Rewrite `_extract_table_data` method to parse HTML using BeautifulSoup -- [x] 1.3 Extract cell content, row index, column index for each `` and `` element -- [x] 1.4 Handle `rowspan` and `colspan` attributes for merged cells -- [x] 1.5 Create `TableCell` objects with proper content and positioning -- [x] 1.6 Populate `TableData.cells` array with extracted `TableCell` objects -- [x] 1.7 Preserve header detection (`` elements) and store in `TableData.headers` - -## 2. Edge Case Handling - -- [x] 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables) -- [x] 2.2 Handle empty cells (create TableCell with empty string content) -- [x] 2.3 Handle tables without `` structure (fallback to current behavior) -- [x] 2.4 Log warnings for unparseable tables instead of failing silently - -## 3. Testing - -- [x] 3.1 Create unit tests for `_extract_table_data` with various HTML table formats -- [x] 3.2 Test simple tables (basic rows/columns) -- [x] 3.3 Test tables with merged cells (rowspan/colspan) -- [x] 3.4 Test tables with header rows (`` elements) -- [x] 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance) -- [ ] 3.6 Integration test: OCR Track PDF generation with tables - -## 4. Verification (Track Isolation) - -- [x] 4.1 Compare OCR Track table output format with Direct Track output format -- [ ] 4.2 Verify PDF Generator renders OCR Track tables correctly -- [x] 4.3 **Direct Track regression test**: `direct_extraction_engine.py` NOT modified (confirmed via git status) -- [x] 4.4 **Hybrid Mode regression test**: `ocr_service.py` NOT modified, image merge logic unchanged -- [x] 4.5 **OCR Track fix verification**: Unit tests confirm: - - `TableData.cells` array is populated (6 cells in 3x2 table) - - `TableCell` objects have correct row/col/content values - - Headers extracted correctly -- [x] 4.6 Verify `DirectExtractionEngine` code is NOT modified (isolation check - confirmed) - -## 5. Dependencies - -- [x] 5.1 Add `beautifulsoup4>=4.12.0` to `requirements.txt` diff --git a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/design.md b/openspec/changes/archive/2025-11-27-add-layout-preprocessing/design.md deleted file mode 100644 index 979b62c..0000000 --- a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/design.md +++ /dev/null @@ -1,192 +0,0 @@ -# Design: Layout Detection Image Preprocessing - -## Context - -PP-StructureV3's layout detection model (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines or low contrast. This is a preprocessing problem - the model can detect tables when lines are clearly visible, but struggles with poor quality scans or documents with light-colored borders. - -### Current Flow -``` -Original Image → PP-Structure (layout detection) → Element Recognition - ↓ - Returns element bboxes - ↓ - Image extraction crops from original -``` - -### Proposed Flow -``` -Original Image → Preprocess → PP-Structure (layout detection) → Element Recognition - ↓ - Returns element bboxes - ↓ -Original Image ← ← ← ← Image extraction crops from original (NOT preprocessed) -``` - -## Goals / Non-Goals - -### Goals -- Improve table detection for documents with faint lines -- Preserve original image quality for element extraction -- **Hybrid control**: Auto mode by default, manual override available -- **Preview capability**: Users can verify preprocessing before processing -- Minimal performance impact - -### Non-Goals -- Preprocessing for text recognition (Raw OCR handles this separately) -- Modifying how PP-Structure internally processes images -- General image quality improvement (out of scope) -- Real-time preview during processing (preview is pre-processing only) - -## Decisions - -### Decision 1: Preprocess only for layout detection input -**Rationale**: -- Layout detection needs enhanced edges/contrast to identify regions -- Image element extraction needs original quality for output -- Raw OCR text recognition works independently and doesn't need preprocessing - -### Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default -**Rationale**: -- CLAHE prevents over-amplification in already bright areas -- Adaptive nature handles varying background regions -- Well-supported by OpenCV - -**Alternatives considered**: -- Global histogram equalization: Too aggressive, causes artifacts -- Manual brightness/contrast: Not adaptive to document variations - -### Decision 3: Preprocessing is applied in-memory, not saved to disk -**Rationale**: -- Preprocessed image is only needed during PP-Structure call -- Saving would increase storage and I/O overhead -- Original image is already saved and used for extraction - -### Decision 4: Sharpening via Unsharp Mask -**Rationale**: -- Enhances edges without introducing noise -- Helps make faint table borders more detectable -- Configurable strength - -### Decision 5: Hybrid Control Mode (Auto + Manual) -**Rationale**: -- Auto mode provides seamless experience for most users -- Manual mode gives power users fine control -- Preview allows verification before committing to processing - -**Auto-detection algorithm**: -```python -def analyze_image_quality(image: np.ndarray) -> dict: - gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) - - # Contrast: standard deviation of pixel values - contrast = np.std(gray) - - # Edge strength: mean of Sobel gradient magnitude - sobel_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3) - sobel_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3) - edge_strength = np.mean(np.sqrt(sobel_x**2 + sobel_y**2)) - - return { - "contrast": contrast, - "edge_strength": edge_strength, - "recommended": { - "contrast": "clahe" if contrast < 40 else "none", - "sharpen": edge_strength < 15, - "binarize": contrast < 20 - } - } -``` - -### Decision 6: Preview API Design -**Rationale**: -- Users should see preprocessing effect before full processing -- Reduces trial-and-error cycles -- Builds user confidence in the system - -**API Design**: -``` -POST /api/v2/tasks/{task_id}/preview/preprocessing -Request: -{ - "page": 1, - "mode": "auto", // or "manual" - "config": { // only for manual mode - "contrast": "clahe", - "sharpen": true, - "binarize": false - } -} - -Response: -{ - "original_url": "/api/v2/tasks/{id}/pages/1/image", - "preprocessed_url": "/api/v2/tasks/{id}/pages/1/image?preprocessed=true", - "quality_metrics": { - "contrast": 35.2, - "edge_strength": 12.8 - }, - "auto_config": { - "contrast": "clahe", - "sharpen": true, - "binarize": false - } -} -``` - -## Implementation Details - -### Preprocessing Pipeline -```python -def enhance_for_layout_detection(image: Image.Image, config: Settings) -> Image.Image: - """Enhance image for better layout detection.""" - - # Step 1: Contrast enhancement - if config.layout_preprocessing_contrast == "clahe": - image = apply_clahe(image) - elif config.layout_preprocessing_contrast == "histogram": - image = apply_histogram_equalization(image) - - # Step 2: Sharpening (optional) - if config.layout_preprocessing_sharpen: - image = apply_unsharp_mask(image) - - # Step 3: Binarization (optional, aggressive) - if config.layout_preprocessing_binarize: - image = apply_adaptive_threshold(image) - - return image -``` - -### Integration Point -```python -# In ocr_service.py, before calling PP-Structure -if settings.layout_preprocessing_enabled: - preprocessed_image = enhance_for_layout_detection(page_image, settings) - pp_input = preprocessed_image -else: - pp_input = page_image - -# PP-Structure gets preprocessed (or original if disabled) -layout_results = self.structure_engine(pp_input) - -# Image extraction still uses original -for element in layout_results: - if element.type == "image": - crop_image_from_original(page_image, element.bbox) # Use original! -``` - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| Performance overhead | Preprocessing is fast (~50ms/page), enable/disable option | -| Over-enhancement artifacts | CLAHE clip limit prevents over-saturation, configurable | -| Memory spike for large images | Process one page at a time, discard preprocessed after use | - -## Open Questions - -1. Should binarization be applied before or after CLAHE? - - Current: After (enhances contrast first, then binarize if needed) - -2. Should preprocessing parameters be tunable per-request or only server-wide? - - Current: Server-wide config only (simpler) diff --git a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/proposal.md b/openspec/changes/archive/2025-11-27-add-layout-preprocessing/proposal.md deleted file mode 100644 index b0e459a..0000000 --- a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/proposal.md +++ /dev/null @@ -1,74 +0,0 @@ -# Change: Add Image Preprocessing for Layout Detection - -## Why - -PP-StructureV3's layout detection (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines, low contrast borders, or poor scan quality. This results in missing table elements in the output, even when the table structure recognition models (SLANeXt) are correctly configured. - -The root cause is that layout detection happens **before** table structure recognition - if a region isn't identified as a "table" in the layout detection stage, the table recognition models never get invoked. - -## What Changes - -- **Add image preprocessing module** for layout detection input - - Contrast enhancement (histogram equalization, CLAHE) - - Optional binarization (adaptive thresholding) - - Sharpening for faint lines - -- **Preserve original images for extraction** - - Preprocessing ONLY affects layout detection input - - Image element extraction continues to use original (preserves quality) - - Raw OCR continues to use original image - -- **Hybrid control mode** (Auto + Manual) - - **Auto mode (default)**: Analyze image quality and auto-select parameters - - Calculate contrast level (standard deviation) - - Detect edge clarity for faint lines - - Apply appropriate preprocessing based on analysis - - **Manual mode**: User can override with specific settings - - Contrast: none / histogram / clahe - - Sharpen: on/off - - Binarize: on/off - -- **Frontend preview API** - - Preview endpoint to show original vs preprocessed comparison - - Users can verify settings before processing - -## Impact - -### Affected Specs -- `ocr-processing` - New preprocessing configuration requirements - -### Affected Code -- `backend/app/services/ocr_service.py` - Add preprocessing before PP-Structure -- `backend/app/core/config.py` - New preprocessing configuration options -- `backend/app/services/preprocessing_service.py` - New service (to be created) -- `backend/app/api/v2/endpoints/preview.py` - New preview API endpoint -- `frontend/src/components/PreprocessingSettings.tsx` - New UI component - -### Track Impact Analysis - -| Track | Impact | Reason | -|-------|--------|--------| -| OCR | Improved layout detection | Preprocessing enhances PP-Structure input | -| Hybrid | Potentially improved | Uses PP-Structure for layout | -| Direct | No impact | Does not use PP-Structure | -| Raw OCR | No impact | Continues using original image | - -### Quality Impact - -| Component | Impact | Reason | -|-----------|--------|--------| -| Table detection | Improved | Enhanced contrast reveals faint borders | -| Image extraction | No change | Uses original image for quality | -| Text recognition | No change | Raw OCR uses original image | -| Reading order | Improved | Better element detection → better ordering | - -## Risks - -1. **Performance overhead**: Preprocessing adds compute time per page - - Mitigation: Make preprocessing optional, cache preprocessed images - -2. **Over-processing**: Strong enhancement may introduce artifacts - - Mitigation: Configurable intensity levels, default to moderate enhancement - -3. **Memory usage**: Keeping both original and preprocessed images - - Mitigation: Preprocessed image is temporary, discarded after layout detection diff --git a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-27-add-layout-preprocessing/specs/ocr-processing/spec.md deleted file mode 100644 index 143bf26..0000000 --- a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/specs/ocr-processing/spec.md +++ /dev/null @@ -1,128 +0,0 @@ -## ADDED Requirements - -### Requirement: Layout Detection Image Preprocessing - -The system SHALL provide optional image preprocessing to enhance layout detection accuracy for documents with faint lines, low contrast, or poor scan quality. - -#### Scenario: Preprocessing improves table detection -- **GIVEN** a document with faint table borders that PP-Structure fails to detect -- **WHEN** layout preprocessing is enabled -- **THEN** the system SHALL preprocess the image before layout detection -- **AND** contrast enhancement SHALL make faint lines more visible -- **AND** PP-Structure SHALL receive the preprocessed image for layout detection - -#### Scenario: Image element extraction uses original quality -- **GIVEN** an image element detected by PP-Structure from preprocessed input -- **WHEN** the system extracts the image element -- **THEN** the system SHALL crop from the ORIGINAL image, not the preprocessed version -- **AND** the extracted image SHALL maintain original quality and colors - -#### Scenario: CLAHE contrast enhancement -- **WHEN** `layout_preprocessing_contrast` is set to "clahe" -- **THEN** the system SHALL apply Contrast Limited Adaptive Histogram Equalization -- **AND** the enhancement SHALL not over-saturate already bright regions - -#### Scenario: Sharpening enhances faint lines -- **WHEN** `layout_preprocessing_sharpen` is enabled -- **THEN** the system SHALL apply unsharp masking to enhance edges -- **AND** faint table borders SHALL become more detectable - -#### Scenario: Optional binarization for extreme cases -- **WHEN** `layout_preprocessing_binarize` is enabled -- **THEN** the system SHALL apply adaptive thresholding -- **AND** this SHALL be used only for documents with very poor contrast - -### Requirement: Preprocessing Hybrid Control Mode - -The system SHALL support three preprocessing modes: automatic, manual, and disabled, with automatic as the default. - -#### Scenario: Auto mode analyzes image quality -- **GIVEN** preprocessing mode is set to "auto" -- **WHEN** processing begins for a page -- **THEN** the system SHALL analyze image quality metrics (contrast, edge strength) -- **AND** automatically determine optimal preprocessing parameters -- **AND** apply recommended settings without user intervention - -#### Scenario: Auto mode detects low contrast -- **GIVEN** preprocessing mode is "auto" -- **WHEN** image contrast (standard deviation) is below 40 -- **THEN** the system SHALL automatically enable CLAHE contrast enhancement - -#### Scenario: Auto mode detects faint edges -- **GIVEN** preprocessing mode is "auto" -- **WHEN** image edge strength (Sobel gradient mean) is below 15 -- **THEN** the system SHALL automatically enable sharpening - -#### Scenario: Manual mode uses user-specified settings -- **GIVEN** preprocessing mode is set to "manual" -- **WHEN** processing begins -- **THEN** the system SHALL use the user-provided preprocessing configuration -- **AND** ignore automatic quality analysis - -#### Scenario: Disabled mode skips preprocessing -- **GIVEN** preprocessing mode is set to "disabled" -- **WHEN** processing begins -- **THEN** the system SHALL skip all preprocessing -- **AND** PP-Structure SHALL receive the original image directly - -### Requirement: Preprocessing Preview API - -The system SHALL provide a preview endpoint that allows users to compare original and preprocessed images before processing. - -#### Scenario: Preview returns comparison images -- **GIVEN** a task with uploaded document -- **WHEN** user requests preprocessing preview for a specific page -- **THEN** the system SHALL return URLs or data for both original and preprocessed images -- **AND** user can visually compare the difference - -#### Scenario: Preview shows auto-detected settings -- **GIVEN** preview is requested with mode "auto" -- **WHEN** the system analyzes the page -- **THEN** the response SHALL include the auto-detected preprocessing configuration -- **AND** include quality metrics (contrast, edge_strength) - -#### Scenario: Preview accepts manual configuration -- **GIVEN** preview is requested with mode "manual" -- **WHEN** user provides specific preprocessing settings -- **THEN** the system SHALL apply those settings to generate preview -- **AND** return the preprocessed result for user verification - -### Requirement: Preprocessing Track Isolation - -The layout preprocessing feature SHALL only affect layout detection input without impacting other processing components. - -#### Scenario: Raw OCR is unaffected -- **GIVEN** layout preprocessing is enabled -- **WHEN** Raw OCR processing runs -- **THEN** Raw OCR SHALL use the original image -- **AND** text detection quality SHALL not be affected by preprocessing - -#### Scenario: Preprocessed image is temporary -- **GIVEN** an image is preprocessed for layout detection -- **WHEN** layout detection completes -- **THEN** the preprocessed image SHALL NOT be persisted to storage -- **AND** only the original image and element crops SHALL be saved - -### Requirement: Preprocessing Frontend UI - -The frontend SHALL provide a user interface for configuring and previewing preprocessing settings. - -#### Scenario: Mode selection is available -- **GIVEN** the user is configuring OCR track processing -- **WHEN** the preprocessing settings panel is displayed -- **THEN** the user SHALL be able to select mode: Auto (default), Manual, or Disabled -- **AND** Auto mode SHALL be pre-selected - -#### Scenario: Manual mode shows configuration options -- **GIVEN** the user selects Manual mode -- **WHEN** the settings panel updates -- **THEN** the user SHALL see options for: - - Contrast enhancement (None / Histogram / CLAHE) - - Sharpen toggle - - Binarize toggle - -#### Scenario: Preview button triggers comparison view -- **GIVEN** preprocessing settings are configured -- **WHEN** the user clicks Preview button -- **THEN** the system SHALL display side-by-side comparison of original and preprocessed images -- **AND** show detected quality metrics diff --git a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/tasks.md b/openspec/changes/archive/2025-11-27-add-layout-preprocessing/tasks.md deleted file mode 100644 index c4bfbf4..0000000 --- a/openspec/changes/archive/2025-11-27-add-layout-preprocessing/tasks.md +++ /dev/null @@ -1,141 +0,0 @@ -# Tasks: Add Image Preprocessing for Layout Detection - -## 1. Configuration - -- [x] 1.1 Add preprocessing configuration to `backend/app/core/config.py` - - `layout_preprocessing_mode: str = "auto"` - Options: auto, manual, disabled - - `layout_preprocessing_contrast: str = "clahe"` - Options: none, histogram, clahe - - `layout_preprocessing_sharpen: bool = True` - Enable sharpening for faint lines - - `layout_preprocessing_binarize: bool = False` - Optional binarization (aggressive) - -- [x] 1.2 Add preprocessing schema to `backend/app/schemas/task.py` - - `PreprocessingMode` enum: auto, manual, disabled - - `PreprocessingConfig` schema for API request/response - -## 2. Preprocessing Service - -- [x] 2.1 Create `backend/app/services/layout_preprocessing_service.py` - - Image loading utility (supports PIL, OpenCV) - - Contrast enhancement methods (histogram equalization, CLAHE) - - Sharpening filter for line enhancement - - Optional adaptive binarization - - Return preprocessed image as numpy array or PIL Image - -- [x] 2.2 Implement `preprocess()` and `preprocess_to_pil()` functions - - Input: Original image path or PIL Image + config - - Output: Preprocessed image (same format as input) + PreprocessingResult - - Steps: contrast → sharpen → (optional) binarize - -- [x] 2.3 Implement `analyze_image_quality()` function (Auto mode) - - Calculate contrast level (standard deviation of grayscale) - - Detect edge clarity (Sobel gradient mean) - - Return ImageQualityMetrics based on analysis - - `get_auto_config()` returns PreprocessingConfig based on thresholds: - - Low contrast < 40: Apply CLAHE - - Faint edges < 15: Apply sharpen - - Very low contrast < 20: Consider binarize - -## 3. Integration with OCR Service - -- [x] 3.1 Update `backend/app/services/ocr_service.py` - - Import preprocessing service - - Check preprocessing mode (auto/manual/disabled) - - If auto: call `analyze_image_quality()` first - - Before PP-Structure prediction, preprocess image based on config - - Pass preprocessed PIL Image to PP-Structure for layout detection - - Keep original image reference for image extraction - -- [x] 3.2 Update `backend/app/services/pp_structure_enhanced.py` - - Add `preprocessed_image` parameter to `analyze_with_full_structure()` - - When preprocessed_image provided, convert to BGR numpy array and pass to PP-Structure - - Bbox coordinates from preprocessed detection applied to original image crop - -- [x] 3.3 Update task start API to accept preprocessing options - - Add `preprocessing_mode` parameter to ProcessingOptions - - Add `preprocessing_config` for manual mode overrides - -## 4. Preview API - -- [x] 4.1 Create preview endpoints in `backend/app/routers/tasks.py` - - `POST /api/v2/tasks/{task_id}/preview/preprocessing` - - Input: page number, preprocessing mode/config - - Output: PreprocessingPreviewResponse with: - - Original image URL - - Preprocessed image URL - - Auto-detected config - - Image quality metrics (contrast, edge_strength) - - `GET /api/v2/tasks/{task_id}/preview/image` - Serve preview images - -- [x] 4.2 Add preview router functionality - - Integrated into tasks router - - Uses task authentication/authorization - -## 5. Frontend UI - -- [x] 5.1 Create `frontend/src/components/PreprocessingSettings.tsx` - - Radio buttons with icons: Auto / Manual / Disabled - - Manual mode shows: - - Contrast dropdown: None / Histogram / CLAHE - - Sharpen checkbox - - Binarize checkbox (with warning) - - Preview button integration (onPreview prop) - -- [ ] 5.2 Create `frontend/src/components/PreprocessingPreview.tsx` (optional) - - Side-by-side image comparison (original vs preprocessed) - - Display detected quality metrics - - Note: Preview functionality available via API, UI modal is optional enhancement - -- [x] 5.3 Integrate with task start flow - - Added PreprocessingSettings to ProcessingPage.tsx - - Pass selected config to task start API - - Note: localStorage preference storage is optional enhancement - -- [x] 5.4 Add i18n translations - - `frontend/src/i18n/locales/zh-TW.json` - Traditional Chinese - -## 6. Testing - -- [x] 6.1 Unit tests for preprocessing_service - - Validated imports and service creation - - Tested `analyze_image_quality()` with test images - - Tested `get_auto_config()` returns sensible config - - Tested `preprocess()` produces correct output shape - -- [ ] 6.2 Integration tests for preview API (optional) - - Manual testing recommended with actual documents - -- [ ] 6.3 End-to-end testing - - Test OCR track with preprocessing modes (auto/manual/disabled) - - Test with known problematic documents (faint table borders) - -## 7. Documentation - -- [x] 7.1 Update API documentation - - Schemas documented in task.py with Field descriptions - - Preview endpoint accessible via /docs - -- [ ] 7.2 Add user guide section (optional) - - When to use auto vs manual - - How to interpret quality metrics - ---- - -## Implementation Summary - -**Backend commits:** -1. `feat: implement layout preprocessing backend` - Core service, OCR integration, preview API - -**Frontend commits:** -1. `feat: add preprocessing UI components and integration` - PreprocessingSettings, i18n, ProcessingPage integration - -**Key files created/modified:** -- `backend/app/services/layout_preprocessing_service.py` (new) -- `backend/app/core/config.py` (updated) -- `backend/app/schemas/task.py` (updated) -- `backend/app/services/ocr_service.py` (updated) -- `backend/app/services/pp_structure_enhanced.py` (updated) -- `backend/app/routers/tasks.py` (updated) -- `frontend/src/components/PreprocessingSettings.tsx` (new) -- `frontend/src/types/apiV2.ts` (updated) -- `frontend/src/pages/ProcessingPage.tsx` (updated) -- `frontend/src/i18n/locales/zh-TW.json` (updated) diff --git a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md deleted file mode 100644 index 0a2dcc9..0000000 --- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md +++ /dev/null @@ -1,183 +0,0 @@ -# Design: OCR Track Gap Filling - -## Context - -PP-StructureV3 版面分析模型在處理某些掃描文件時會嚴重漏檢。實測顯示 Raw PaddleOCR 能偵測 56 個文字區域,但 PP-StructureV3 僅輸出 9 個元素(遺失 84%)。 - -問題發生在 PP-StructureV3 內部的 Layout Detection Model,這是 PaddleOCR 函式庫的限制,無法從外部修復。但 Raw OCR 的 `text_regions` 資料仍然完整可用。 - -### Stakeholders -- **End users**: 需要完整的 OCR 輸出,不能有大量文字遺失 -- **OCR track**: 需要整合 Raw OCR 與 PP-StructureV3 結果 -- **Direct/Hybrid track**: 不應受此變更影響 - -## Goals / Non-Goals - -### Goals -- 偵測 PP-StructureV3 漏檢區域並以 Raw OCR 結果補回 -- 確保補回的文字不會與現有元素重複 -- 維持正確的閱讀順序 -- 僅影響 OCR track,不改變其他 track 的行為 - -### Non-Goals -- 不修改 PP-StructureV3 或 PaddleOCR 內部邏輯 -- 不處理圖片/表格/圖表等非文字元素的補漏 -- 不實作複雜的版面分析(僅做 gap filling) - -## Decisions - -### Decision 1: 覆蓋判定策略 -**選擇**: 優先使用「中心點落入」判定,輔以 IoU 閾值 - -**理由**: -- 中心點判定計算簡單,效能好 -- IoU 閾值作為補充,處理邊界情況 -- 建議 IoU 閾值 0.1~0.2,避免低 IoU 被誤判為未覆蓋 - -**替代方案**: -- 純 IoU 判定:計算量較大,且對部分重疊的處理較複雜 -- 面積比例判定:對不同大小的區域不夠公平 - -### Decision 2: 補漏觸發條件 -**選擇**: 當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR - -**理由**: -- 避免正常文件出現重複文字 -- 70% 閾值經驗值,可透過設定調整 -- 元素數比較作為快速判斷條件 - -### Decision 3: 補漏元素類型 -**選擇**: 僅補 TEXT 類型,跳過 TABLE/IMAGE/FIGURE/FLOWCHART/HEADER/FOOTER - -**理由**: -- PP-StructureV3 對結構化元素(表格、圖片)的識別通常較準確 -- 補回原始 OCR 文字可能破壞表格結構 -- 這些元素需要保持結構完整性 - -### Decision 4: 重複判定與去重 -**選擇**: IoU > 0.5 的 Raw OCR 區域視為與 PP-Structure TEXT 重複,跳過 - -**理由**: -- 0.5 是常見的重疊閾值 -- 避免同一文字出現兩次 -- 對細碎的 Raw OCR 框可考慮輕量合併 - -### Decision 5: 座標對齊 -**選擇**: 使用 `ocr_dimensions` 進行 bbox 換算 - -**理由**: -- OCR 可能有 resize 處理 -- 確保 Raw OCR 與 PP-Structure 的座標在同一空間 -- 避免因尺寸不一致導致覆蓋誤判 - -## Data Flow - -``` -┌─────────────────┐ ┌──────────────────────┐ -│ Raw OCR Result │ │ PP-StructureV3 Result│ -│ (56 regions) │ │ (9 elements) │ -└────────┬────────┘ └──────────┬───────────┘ - │ │ - └────────────┬────────────┘ - │ - ┌───────▼───────┐ - │ GapFillingService │ - │ 1. Calculate coverage - │ 2. Find uncovered regions - │ 3. Filter by confidence - │ 4. Deduplicate - │ 5. Merge if needed - └───────┬───────┘ - │ - ┌───────▼───────┐ - │ OCRToUnifiedConverter │ - │ - Combine elements - │ - Recalculate reading order - └───────┬───────┘ - │ - ┌───────▼───────┐ - │ UnifiedDocument │ - │ (complete content) - └───────────────┘ -``` - -## Algorithm: Gap Detection - -```python -def find_uncovered_regions( - raw_ocr_regions: List[TextRegion], - pp_structure_elements: List[Element], - iou_threshold: float = 0.15 -) -> List[TextRegion]: - """ - Find Raw OCR regions not covered by PP-Structure elements. - - Coverage criteria (either one): - 1. Center point of raw region falls inside any PP-Structure bbox - 2. IoU with any PP-Structure bbox > iou_threshold - """ - uncovered = [] - - # Filter PP-Structure elements: only consider TEXT, skip TABLE/IMAGE/etc. - text_elements = [e for e in pp_structure_elements - if e.type not in SKIP_TYPES] - - for region in raw_ocr_regions: - center = get_center(region.bbox) - is_covered = False - - for element in text_elements: - # Check center point - if point_in_bbox(center, element.bbox): - is_covered = True - break - - # Check IoU - if calculate_iou(region.bbox, element.bbox) > iou_threshold: - is_covered = True - break - - if not is_covered: - uncovered.append(region) - - return uncovered -``` - -## Configuration Parameters - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `gap_filling_enabled` | bool | True | 是否啟用 gap filling | -| `gap_filling_coverage_threshold` | float | 0.7 | 覆蓋率低於此值時啟用 | -| `gap_filling_iou_threshold` | float | 0.15 | 覆蓋判定 IoU 閾值 | -| `gap_filling_confidence_threshold` | float | 0.3 | Raw OCR 信心度門檻 | -| `gap_filling_dedup_iou_threshold` | float | 0.5 | 去重 IoU 閾值 | - -## Risks / Trade-offs - -### Risk 1: 補漏造成文字重複 -**Mitigation**: 設定 dedup_iou_threshold,對高重疊區域進行去重 - -### Risk 2: 閱讀順序錯亂 -**Mitigation**: 補回元素後重新計算整頁的 reading_order(依 y0, x0 排序) - -### Risk 3: 效能影響 -**Mitigation**: -- 先做快速的覆蓋率檢查,若 > 70% 則跳過 gap filling -- 使用 R-tree 或 interval tree 加速 bbox 查詢(若效能成為瓶頸) - -### Risk 4: 座標不對齊 -**Mitigation**: 使用 `ocr_dimensions` 確保座標空間一致 - -## Migration Plan - -1. 新增功能為可選(預設啟用) -2. 可透過設定關閉 gap filling -3. 不影響現有 API 介面 -4. 向後相容:不傳參數時使用預設行為 - -## Open Questions - -1. 是否需要 UI 開關讓使用者選擇啟用/停用 gap filling? -2. 對於細碎的 Raw OCR 框,是否需要實作合併邏輯?(同行、相鄰且間距很小) -3. 是否需要在輸出中標記哪些元素是補漏來的?(debug 用途) diff --git a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md deleted file mode 100644 index 360c592..0000000 --- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md +++ /dev/null @@ -1,30 +0,0 @@ -# Change: Add OCR Track Gap Filling with Raw OCR Text Regions - -## Why - -PP-StructureV3 的版面分析模型在處理某些掃描文件時會嚴重漏檢,導致大量文字內容遺失。實測 scan.pdf 顯示: -- Raw PaddleOCR 文字識別:偵測到 **56 個文字區域** -- PP-StructureV3 版面分析:僅輸出 **9 個元素** -- 遺失比例:約 **84%** 的內容未被 PP-StructureV3 識別 - -問題根源在於 PP-StructureV3 內部的 Layout Detection Model 對掃描文件類型支援不足,而非我們的程式碼問題。Raw OCR 能正確偵測所有文字區域,但這些資訊在 PP-StructureV3 的結構化處理過程中被遺失。 - -## What Changes - -實作「混合式處理」(Hybrid Approach):使用 Raw OCR 的文字區域來補充 PP-StructureV3 遺失的內容。 - -- **新增** `GapFillingService` 類別,負責偵測並補回 PP-StructureV3 遺漏的文字區域 -- **新增** 覆蓋率計算邏輯(中心點落入或 IoU 閾值判斷) -- **新增** 自動啟用條件:當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR 框數 -- **修改** `OCRToUnifiedConverter` 整合 gap filling 邏輯 -- **新增** 重新計算 reading_order 邏輯(依 y0, x0 排序) -- **新增** 測試案例:PP-Structure 嚴重漏檢案例、無漏檢正常文件驗證 - -## Impact - -- **Affected specs**: `ocr-processing` -- **Affected code**: - - `backend/app/services/ocr_to_unified_converter.py` - 整合 gap filling - - `backend/app/services/gap_filling_service.py` - 新增 (核心邏輯) - - `backend/tests/test_gap_filling.py` - 新增 (測試) -- **Track isolation**: 僅作用於 OCR track;Direct/Hybrid track 不受影響 diff --git a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md deleted file mode 100644 index 1eee6b7..0000000 --- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md +++ /dev/null @@ -1,111 +0,0 @@ -## ADDED Requirements - -### Requirement: OCR Track Gap Filling with Raw OCR Regions - -The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected. - -#### Scenario: Gap filling activates when coverage is low -- **GIVEN** an OCR track processing task -- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions -- **THEN** the system SHALL activate gap filling -- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element -- **AND** supplement these regions as TEXT elements in the output - -#### Scenario: Coverage is determined by center-point and IoU -- **GIVEN** a Raw OCR text region with bounding box -- **WHEN** checking if the region is covered by PP-StructureV3 -- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox -- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold -- **AND** regions not meeting either criterion SHALL be marked as uncovered - -#### Scenario: Only TEXT elements are supplemented -- **GIVEN** uncovered Raw OCR regions identified for supplementation -- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements -- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements -- **AND** only supplement regions as TEXT type to preserve structural integrity - -#### Scenario: Supplemented regions meet confidence threshold -- **GIVEN** Raw OCR regions to be supplemented -- **WHEN** a region has confidence score below 0.3 -- **THEN** the system SHALL skip that region -- **AND** only supplement regions with confidence >= 0.3 - -#### Scenario: Deduplication prevents repeated text -- **GIVEN** a Raw OCR region being considered for supplementation -- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element -- **THEN** the system SHALL skip that region to prevent duplicate text -- **AND** the original PP-StructureV3 element SHALL be preserved - -#### Scenario: Reading order is recalculated after gap filling -- **GIVEN** supplemented elements have been added to the page -- **WHEN** assembling the final element list -- **THEN** the system SHALL recalculate reading order for the entire page -- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right) -- **AND** ensure logical document flow is maintained - -#### Scenario: Coordinate alignment with ocr_dimensions -- **GIVEN** Raw OCR processing may involve image resizing -- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox -- **THEN** the system SHALL use ocr_dimensions to normalize coordinates -- **AND** ensure both sources reference the same coordinate space -- **AND** prevent coverage misdetection due to scale differences - -#### Scenario: Supplemented elements have complete metadata -- **GIVEN** a Raw OCR region being added as supplemented element -- **WHEN** creating the DocumentElement -- **THEN** the element SHALL include page_number -- **AND** include confidence score from Raw OCR -- **AND** include original bbox coordinates -- **AND** optionally include source indicator for debugging - -### Requirement: Gap Filling Track Isolation - -The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs. - -#### Scenario: Gap filling only activates for OCR track -- **GIVEN** a document processing task -- **WHEN** the processing track is OCR -- **THEN** the system SHALL evaluate and apply gap filling as needed -- **AND** produce enhanced output with supplemented content - -#### Scenario: Direct track is unaffected -- **GIVEN** a document processing task with Direct track -- **WHEN** the task is processed -- **THEN** the system SHALL NOT invoke any gap filling logic -- **AND** produce output identical to current Direct track behavior - -#### Scenario: Hybrid track is unaffected -- **GIVEN** a document processing task with Hybrid track -- **WHEN** the task is processed -- **THEN** the system SHALL NOT invoke gap filling logic -- **AND** use existing Hybrid track processing pipeline - -### Requirement: Gap Filling Configuration - -The system SHALL provide configurable parameters for gap filling behavior. - -#### Scenario: Gap filling can be disabled via configuration -- **GIVEN** gap_filling_enabled is set to false in configuration -- **WHEN** OCR track processing runs -- **THEN** the system SHALL skip all gap filling logic -- **AND** output only PP-StructureV3 results as before - -#### Scenario: Coverage threshold is configurable -- **GIVEN** gap_filling_coverage_threshold is set to 0.8 -- **WHEN** PP-StructureV3 coverage is 75% -- **THEN** the system SHALL activate gap filling -- **AND** supplement uncovered regions - -#### Scenario: IoU thresholds are configurable -- **GIVEN** custom IoU thresholds configured: - - gap_filling_iou_threshold: 0.2 - - gap_filling_dedup_iou_threshold: 0.6 -- **WHEN** evaluating coverage and deduplication -- **THEN** the system SHALL use the configured values -- **AND** apply them consistently throughout gap filling process - -#### Scenario: Confidence threshold is configurable -- **GIVEN** gap_filling_confidence_threshold is set to 0.5 -- **WHEN** supplementing Raw OCR regions -- **THEN** the system SHALL only include regions with confidence >= 0.5 -- **AND** filter out lower confidence regions diff --git a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md deleted file mode 100644 index 2e31126..0000000 --- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md +++ /dev/null @@ -1,44 +0,0 @@ -# Tasks: Add OCR Track Gap Filling - -## 1. Core Implementation - -- [x] 1.1 Create `gap_filling_service.py` with `GapFillingService` class -- [x] 1.2 Implement bbox coverage calculation (center-point and IoU methods) -- [x] 1.3 Implement gap detection logic (find uncovered raw OCR regions) -- [x] 1.4 Implement confidence threshold filtering for supplemented regions -- [x] 1.5 Implement element type filtering (only supplement TEXT, skip TABLE/IMAGE/FIGURE/etc.) -- [x] 1.6 Implement reading order recalculation (sort by y0, x0) -- [x] 1.7 Implement deduplication logic (skip high IoU overlaps with PP-Structure TEXT) -- [x] 1.8 Implement optional text merging for fragmented adjacent regions - -## 2. Integration - -- [x] 2.1 Modify `OCRToUnifiedConverter` to accept raw OCR text_regions -- [x] 2.2 Add gap filling activation condition check (coverage < 70% or element count disparity) -- [x] 2.3 Ensure coordinate alignment between raw OCR and PP-Structure (ocr_dimensions handling) -- [x] 2.4 Add page metadata (page_number, confidence, bbox) to supplemented elements -- [x] 2.5 Ensure track isolation (only OCR track, not Direct/Hybrid) - -## 3. Configuration - -- [x] 3.1 Add configurable parameters to settings: - - `gap_filling_enabled`: bool (default: True) - - `gap_filling_coverage_threshold`: float (default: 0.7) - - `gap_filling_iou_threshold`: float (default: 0.15) - - `gap_filling_confidence_threshold`: float (default: 0.3) - - `gap_filling_dedup_iou_threshold`: float (default: 0.5) - -## 4. Testing(with env) - -- [x] 4.1 Create test fixtures with PP-Structure severe miss-detection case(with scan.pdf / scan2.pdf) -- [x] 4.2 Test gap detection correctly identifies uncovered regions -- [x] 4.3 Test supplemented elements have correct metadata -- [x] 4.4 Test reading order is correctly recalculated -- [x] 4.5 Test deduplication prevents duplicate text -- [x] 4.6 Test normal document without miss-detection has no duplicate/inflation -- [x] 4.7 Test track isolation (Direct track unaffected) - -## 5. Documentation - -- [x] 5.1 Add inline documentation to GapFillingService -- [x] 5.2 Update configuration documentation with new settings diff --git a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md deleted file mode 100644 index 2cd6137..0000000 --- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md +++ /dev/null @@ -1,108 +0,0 @@ -# Fix OCR Track Table Rendering - -## Summary - -OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken. - -## Problem Statement - -When generating PDF from OCR track results (via `scan.pdf` processed by PP-StructureV3), the output tables have: -1. **Wrong cell alignment** - content not positioned in proper cells -2. **Missing table structure** - rows/columns don't match original document layout -3. **Incorrect content distribution** - all content seems to flow linearly instead of maintaining grid structure - -Reference: `backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/` -- Original: `af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png` -- Generated: `scan_layout.pdf` -- Result JSON: `scan_result.json` - Tables have correct `{rows, cols, cells}` structure - -## Root Cause Analysis - -### Issue 1: Table Content Not Converted to TableData Object - -In `_json_to_document_element` (pdf_generator_service.py:1952): -```python -element = DocumentElement( - ... - content=elem_dict.get('content', ''), # Raw dict, not TableData - ... -) -``` - -Table elements have `content` as a dict `{rows: 5, cols: 4, cells: [...]}` but it's not converted to a `TableData` object. - -### Issue 2: OCR Track HTML Conversion Fails - -In `convert_unified_document_to_ocr_data` (pdf_generator_service.py:464-467): -```python -elif isinstance(element.content, dict): - html_content = element.content.get('html', str(element.content)) -``` - -Since there's no 'html' key in the cells-based dict, it falls back to `str(element.content)` = `"{'rows': 5, 'cols': 4, ...}"` - invalid HTML. - -### Issue 3: Different Table Rendering Paths - -- **Direct track** uses `_draw_table_element_direct` which properly handles dict with cells via `_build_rows_from_cells_dict` -- **OCR track** uses `draw_table_region` which expects HTML strings and fails with dict content - -## Proposed Solution - -### Option A: Convert dict to TableData during JSON loading (Recommended) - -In `_json_to_document_element`, when element type is TABLE and content is a dict with cells, convert it to a `TableData` object: - -```python -# For TABLE elements, convert dict to TableData -if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content: - content = self._dict_to_table_data(content) -``` - -This ensures `element.content.to_html()` works correctly in `convert_unified_document_to_ocr_data`. - -### Option B: Fix conversion in convert_unified_document_to_ocr_data - -Handle dict with cells properly by converting to HTML: - -```python -elif isinstance(element.content, dict): - if 'cells' in element.content: - # Convert cells-based dict to HTML - html_content = self._cells_dict_to_html(element.content) - elif 'html' in element.content: - html_content = element.content['html'] - else: - html_content = str(element.content) -``` - -## Impact on Hybrid Mode - -Hybrid mode uses Direct track rendering (`_generate_direct_track_pdf`) which already handles dict content properly via `_build_rows_from_cells_dict`. The proposed fixes should not affect hybrid mode negatively. - -However, testing should verify: -1. Hybrid mode continues to work with combined Direct + OCR elements -2. Table rendering quality is consistent across all tracks - -## Success Criteria - -1. OCR track tables render with correct structure matching original document -2. Cell content positioned in proper grid locations -3. Table borders/grid lines visible -4. No regression in Direct track or Hybrid mode table rendering -5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output - -## Files to Modify - -1. `backend/app/services/pdf_generator_service.py` - - `_json_to_document_element`: Convert table dict to TableData - - `convert_unified_document_to_ocr_data`: Improve dict handling (if Option B) - -2. `backend/app/models/unified_document.py` (optional) - - Add `TableData.from_dict()` class method for cleaner conversion - -## Testing Plan - -1. Test scan.pdf with OCR track - verify table structure matches original -2. Test img1.png, img2.png, img3.png with OCR track -3. Test PDF files with Direct track - verify no regression -4. Test Hybrid mode with files that trigger OCR fallback diff --git a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md deleted file mode 100644 index c1d7831..0000000 --- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md +++ /dev/null @@ -1,52 +0,0 @@ -# PDF Generation - OCR Track Table Rendering Fix - -## MODIFIED Requirements - -### Requirement: OCR Track Table Content Conversion - -The PDF generator MUST properly convert table content from JSON dict format to renderable structure when processing OCR track results. - -#### Scenario: Table dict with cells array converts to proper HTML - -Given an OCR track JSON with table element containing rows, cols, and cells array -When the PDF generator processes this element -Then the table content MUST be converted to a TableData object -And TableData.to_html() MUST produce valid HTML with proper tr/td structure -And the generated PDF table MUST have cells positioned in correct grid locations - -#### Scenario: Table with rowspan/colspan renders correctly - -Given a table element with cells having rowspan > 1 or colspan > 1 -When the PDF generator renders the table -Then merged cells MUST span the correct number of rows/columns -And content MUST appear in the merged cell position - -### Requirement: Table Visual Fidelity - -The PDF generator MUST render OCR track tables with visual structure matching the original document. - -#### Scenario: Table renders with grid lines - -Given an OCR track table element -When rendered to PDF -Then the table MUST have visible grid lines/borders -And cell boundaries MUST be clearly defined - -#### Scenario: Table text alignment preserved - -Given an OCR track table with cell content -When rendered to PDF -Then text MUST be positioned within the correct cell boundaries -And text MUST NOT overflow into adjacent cells - -### Requirement: Backward Compatibility with Hybrid Mode - -The table rendering fix MUST NOT break hybrid mode processing. - -#### Scenario: Hybrid mode tables render correctly - -Given a document processed with hybrid mode combining Direct and OCR tracks -When PDF is generated -Then Direct track tables MUST render with existing quality -And OCR track tables MUST render with improved quality -And no regression in table positioning or content diff --git a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md deleted file mode 100644 index 15361ba..0000000 --- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md +++ /dev/null @@ -1,55 +0,0 @@ -# Implementation Tasks - -## Phase 1: Core Fix - Table Content Conversion - -### 1.1 Add TableData.from_dict() class method -- [ ] In `unified_document.py`, add `from_dict()` method to `TableData` class -- [ ] Handle conversion of cells list (list of dicts) to `TableCell` objects -- [ ] Preserve rows, cols, headers, caption fields - -### 1.2 Fix _json_to_document_element for TABLE elements -- [ ] In `pdf_generator_service.py`, modify `_json_to_document_element` -- [ ] When `elem_type == ElementType.TABLE` and content is dict with 'cells', convert to `TableData` -- [ ] Use `TableData.from_dict()` for clean conversion - -### 1.3 Verify TableData.to_html() generates correct HTML -- [ ] Test that `to_html()` produces parseable HTML with proper row/cell structure -- [ ] Verify colspan/rowspan attributes are correctly generated -- [ ] Ensure empty cells are properly handled - -## Phase 2: OCR Track Rendering Consistency - -### 2.1 Review convert_unified_document_to_ocr_data -- [ ] Verify TableData objects are properly converted to HTML -- [ ] Add fallback handling for dict content with 'cells' key -- [ ] Log warning if content cannot be converted to HTML - -### 2.2 Review draw_table_region -- [ ] Verify HTMLTableParser correctly parses generated HTML -- [ ] Check that ReportLab Table is positioned at correct bbox -- [ ] Verify font and style application - -## Phase 3: Testing and Verification - -### 3.1 Test OCR Track -- [ ] Test scan.pdf - verify tables have correct structure -- [ ] Test img1.png, img2.png, img3.png -- [ ] Compare generated PDF with original documents - -### 3.2 Test Direct Track (Regression) -- [ ] Test PDF files with Direct track -- [ ] Verify table rendering unchanged - -### 3.3 Test Hybrid Mode -- [ ] Test files that trigger hybrid processing -- [ ] Verify mixed Direct + OCR elements render correctly - -## Phase 4: Code Quality - -### 4.1 Add logging -- [ ] Add debug logging for table content type detection -- [ ] Log conversion steps for troubleshooting - -### 4.2 Error handling -- [ ] Handle malformed cell data gracefully -- [ ] Log warnings for unexpected content formats diff --git a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md deleted file mode 100644 index 1b06fc2..0000000 --- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md +++ /dev/null @@ -1,40 +0,0 @@ -# Change: Simplify PP-StructureV3 Configuration with Layout Model Selection - -## Why - -Current PP-StructureV3 parameter adjustment UI exposes 7 technical ML parameters (thresholds, ratios, merge modes) that are difficult for end users to understand. Meanwhile, switching to a different layout detection model (e.g., CDLA-trained models for Chinese documents) would have a much greater impact on OCR quality than fine-tuning these parameters. - -**Problems with current approach:** -- Users don't understand what `layout_detection_threshold` or `text_det_unclip_ratio` mean -- Wrong parameter values can make OCR results worse -- The default model (PubLayNet-based) is optimized for English academic papers, not Chinese business documents -- Model selection is far more impactful than parameter tuning - -## What Changes - -### Backend Changes -- **REMOVED**: API parameter `pp_structure_params` from task start endpoint -- **ADDED**: New API parameter `layout_model` with predefined options: - - `"default"` - Standard model (PubLayNet-based, for English documents) - - `"chinese"` - PP-DocLayout-S model (for Chinese documents, forms, contracts) - - `"cdla"` - CDLA model (alternative Chinese document layout model) -- **MODIFIED**: PP-StructureV3 initialization uses `layout_detection_model_name` based on selection -- Keep fine-tuning parameters in backend `config.py` with optimized defaults - -### Frontend Changes -- **REMOVED**: `PPStructureParams.tsx` component (slider/dropdown UI for 7 parameters) -- **ADDED**: Simple radio button/dropdown for layout model selection with clear descriptions -- **MODIFIED**: Task start request body to send `layout_model` instead of `pp_structure_params` - -### API Changes -- **BREAKING**: Remove `pp_structure_params` from `POST /api/v2/tasks/{task_id}/start` -- **ADDED**: New optional parameter `layout_model: "default" | "chinese" | "cdla"` - -## Impact - -- Affected specs: `ocr-processing` -- Affected code: - - Backend: `app/routers/tasks.py`, `app/services/ocr_service.py`, `app/core/config.py` - - Frontend: `src/components/PPStructureParams.tsx` (remove), `src/types/apiV2.ts`, task start form -- Breaking change: Clients using `pp_structure_params` will need to migrate to `layout_model` -- User impact: Simpler UI, better default OCR quality for Chinese documents diff --git a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md deleted file mode 100644 index 86b4aba..0000000 --- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md +++ /dev/null @@ -1,86 +0,0 @@ -# ocr-processing Specification Delta - -## REMOVED Requirements - -### Requirement: Frontend-Adjustable PP-StructureV3 Parameters -**Reason**: Complex ML parameters are difficult for end users to understand and tune. Model selection provides better UX and more significant quality improvements. -**Migration**: Replace `pp_structure_params` API parameter with `layout_model` parameter. - -### Requirement: PP-StructureV3 Parameter UI Controls -**Reason**: Slider/dropdown UI for 7 technical parameters adds complexity without proportional benefit. Simple model selection is more user-friendly. -**Migration**: Remove `PPStructureParams.tsx` component, add `LayoutModelSelector.tsx` component. - -## ADDED Requirements - -### Requirement: Layout Model Selection -The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning. - -#### Scenario: User selects Chinese document model -- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices) -- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S) -- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model -- **AND** the model SHALL be optimized for 23 Chinese document element types -- **AND** table and form detection accuracy SHALL be improved over the default model - -#### Scenario: User selects standard model for English documents -- **GIVEN** a user is processing English academic papers or reports -- **WHEN** the user selects "Standard Model" (PubLayNet-based) -- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model -- **AND** the model SHALL be optimized for English document layouts - -#### Scenario: User selects CDLA model for specialized Chinese layout -- **GIVEN** a user is processing Chinese documents with complex layouts -- **WHEN** the user selects "CDLA Model" -- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model -- **AND** the model SHALL provide specialized Chinese document layout analysis - -#### Scenario: Layout model is sent via API request -- **GIVEN** a frontend application with model selection UI -- **WHEN** the user starts task processing with a selected model -- **THEN** the frontend SHALL send the model choice in the request body: - ```json - POST /api/v2/tasks/{task_id}/start - { - "use_dual_track": true, - "force_track": "ocr", - "language": "ch", - "layout_model": "chinese" - } - ``` -- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model - -#### Scenario: Default model when not specified -- **GIVEN** an API request without `layout_model` parameter -- **WHEN** the task is started -- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model -- **AND** processing SHALL work correctly without requiring model selection - -#### Scenario: Invalid model name is rejected -- **GIVEN** a request with an invalid `layout_model` value -- **WHEN** the user sends `layout_model: "invalid_model"` -- **THEN** the API SHALL return 422 Validation Error -- **AND** provide a clear error message listing valid model options - -### Requirement: Layout Model Selection UI -The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option. - -#### Scenario: Model options are displayed with descriptions -- **GIVEN** the model selection UI is displayed -- **WHEN** the user views the available options -- **THEN** the UI SHALL show the following options: - - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices - - "Standard Model" - for English academic papers, reports - - "CDLA Model" - for specialized Chinese layout analysis -- **AND** each option SHALL have a brief description of its use case - -#### Scenario: Chinese model is selected by default -- **GIVEN** the user opens the task processing interface -- **WHEN** the model selection is displayed -- **THEN** "Chinese Document Model" SHALL be pre-selected as the default -- **AND** the user MAY change the selection before starting processing - -#### Scenario: Model selection is visible only for OCR track -- **GIVEN** a document processing interface -- **WHEN** the user selects processing track -- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected -- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3) diff --git a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md deleted file mode 100644 index 9ab0989..0000000 --- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md +++ /dev/null @@ -1,56 +0,0 @@ -# Implementation Tasks - -## 1. Backend API Changes - -- [x] 1.1 Update `app/schemas/task.py` to add `layout_model` enum type -- [x] 1.2 Update `app/routers/tasks.py` to replace `pp_structure_params` with `layout_model` parameter -- [x] 1.3 Update `app/services/ocr_service.py` to map `layout_model` to `layout_detection_model_name` -- [x] 1.4 Remove custom PP-Structure engine creation logic (use model selection instead) -- [x] 1.5 Add backward compatibility: default to "chinese" if no model specified - -## 2. Backend Configuration - -- [x] 2.1 Keep `layout_detection_model_name` in `config.py` as fallback default -- [x] 2.2 Keep fine-tuning parameters in `config.py` (not exposed to API) -- [x] 2.3 Document available layout models in config comments - -## 3. Frontend Changes - -- [x] 3.1 Remove `PPStructureParams.tsx` component -- [x] 3.2 Update `src/types/apiV2.ts`: - - Remove `PPStructureV3Params` interface - - Add `LayoutModel` type: `"default" | "chinese" | "cdla"` - - Update `ProcessingOptions` to use `layout_model` instead of `pp_structure_params` -- [x] 3.3 Create `LayoutModelSelector.tsx` component with: - - Radio buttons or dropdown for model selection - - Clear descriptions for each model option - - Default selection: "chinese" -- [x] 3.4 Update task start form to use new `LayoutModelSelector` -- [x] 3.5 Update API calls to send `layout_model` instead of `pp_structure_params` - -## 4. Internationalization - -- [x] 4.1 Add i18n strings for layout model options: - - `layoutModel.default`: "Standard Model (English documents)" - - `layoutModel.chinese`: "Chinese Document Model (Recommended)" - - `layoutModel.cdla`: "CDLA Model (Chinese layout analysis)" -- [x] 4.2 Add i18n strings for model descriptions - -## 5. Testing - -- [x] 5.1 Create new tests for `layout_model` parameter (`test_layout_model_api.py`, `test_layout_model.py`) -- [x] 5.2 Archive tests for `pp_structure_params` validation (moved to `tests/archived/`) -- [x] 5.3 Add tests for layout model selection (19 tests passing) -- [x] 5.4 Test backward compatibility (no model specified → use chinese default) - -## 6. Documentation - -- [ ] 6.1 Update API documentation for task start endpoint -- [ ] 6.2 Remove PP-Structure parameter documentation -- [ ] 6.3 Add layout model selection documentation - -## 7. Cleanup - -- [x] 7.1 Remove localStorage keys for PP-Structure params (`pp_structure_params_presets`, `pp_structure_params_last_used`) -- [x] 7.2 Remove any unused imports/types related to PP-Structure params -- [x] 7.3 Archive old PP-Structure params test files diff --git a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md deleted file mode 100644 index a67d72e..0000000 --- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md +++ /dev/null @@ -1,141 +0,0 @@ -# PP-StructureV3 Model Cache Cleanup Guide - -## Overview - -After upgrading PP-StructureV3 models, older unused models may remain in the cache directory. This guide explains how to safely remove them to free disk space. - -## Model Cache Location - -PaddleX/PaddleOCR 3.x stores downloaded models in: - -``` -~/.paddlex/official_models/ -``` - -## Models After Upgrade - -### Current Active Models (DO NOT DELETE) - -| Model | Purpose | Approx. Size | -|-------|---------|--------------| -| `PP-DocLayout_plus-L` | Layout detection for Chinese documents | ~350MB | -| `SLANeXt_wired` | Table structure recognition (bordered tables) | ~351MB | -| `SLANeXt_wireless` | Table structure recognition (borderless tables) | ~351MB | -| `PP-FormulaNet_plus-L` | Formula recognition (Chinese + English) | ~800MB | -| `PP-OCRv5_*` | Text detection and recognition | ~150MB | -| `picodet_lcnet_x1_0_fgd_layout_cdla` | CDLA layout model option | ~10MB | - -### Deprecated Models (Safe to Delete) - -| Model | Reason | Approx. Size | -|-------|--------|--------------| -| `PP-DocLayout-S` | Replaced by PP-DocLayout_plus-L | ~50MB | -| `SLANet` | Replaced by SLANeXt_wired/wireless | ~7MB | -| `SLANet_plus` | Replaced by SLANeXt_wired/wireless | ~7MB | -| `PP-FormulaNet-S` | Replaced by PP-FormulaNet_plus-L | ~200MB | -| `PP-FormulaNet-L` | Replaced by PP-FormulaNet_plus-L | ~400MB | - -## Cleanup Commands - -### List Current Cache - -```bash -# List all cached models -ls -la ~/.paddlex/official_models/ - -# Show disk usage per model -du -sh ~/.paddlex/official_models/* -``` - -### Delete Deprecated Models - -```bash -# Remove deprecated layout model -rm -rf ~/.paddlex/official_models/PP-DocLayout-S - -# Remove deprecated table models -rm -rf ~/.paddlex/official_models/SLANet -rm -rf ~/.paddlex/official_models/SLANet_plus - -# Remove deprecated formula models (if present) -rm -rf ~/.paddlex/official_models/PP-FormulaNet-S -rm -rf ~/.paddlex/official_models/PP-FormulaNet-L -``` - -### Cleanup Script - -```bash -#!/bin/bash -# cleanup_old_models.sh - Remove deprecated PP-StructureV3 models - -CACHE_DIR="$HOME/.paddlex/official_models" - -echo "PP-StructureV3 Model Cleanup" -echo "============================" -echo "" - -# Check if cache directory exists -if [ ! -d "$CACHE_DIR" ]; then - echo "Cache directory not found: $CACHE_DIR" - exit 0 -fi - -# List deprecated models -DEPRECATED_MODELS=( - "PP-DocLayout-S" - "SLANet" - "SLANet_plus" - "PP-FormulaNet-S" - "PP-FormulaNet-L" -) - -echo "Checking for deprecated models..." -echo "" - -TOTAL_SIZE=0 -for model in "${DEPRECATED_MODELS[@]}"; do - MODEL_PATH="$CACHE_DIR/$model" - if [ -d "$MODEL_PATH" ]; then - SIZE=$(du -sh "$MODEL_PATH" 2>/dev/null | cut -f1) - echo "Found: $model ($SIZE)" - TOTAL_SIZE=$((TOTAL_SIZE + 1)) - fi -done - -if [ $TOTAL_SIZE -eq 0 ]; then - echo "No deprecated models found. Cache is clean." - exit 0 -fi - -echo "" -read -p "Delete these models? [y/N]: " confirm - -if [ "$confirm" = "y" ] || [ "$confirm" = "Y" ]; then - for model in "${DEPRECATED_MODELS[@]}"; do - MODEL_PATH="$CACHE_DIR/$model" - if [ -d "$MODEL_PATH" ]; then - rm -rf "$MODEL_PATH" - echo "Deleted: $model" - fi - done - echo "" - echo "Cleanup complete." -else - echo "Cleanup cancelled." -fi -``` - -## Space Savings Estimate - -After cleanup, you can expect to free approximately: -- **~65MB** from deprecated layout model -- **~14MB** from deprecated table models -- **~600MB** from deprecated formula models (if present) - -Total potential savings: **~680MB** - -## Notes - -1. Models are downloaded on first use. Deleting active models will trigger re-download. -2. The cache directory may vary if `PADDLEX_HOME` environment variable is set. -3. Always verify which models your configuration uses before deleting. diff --git a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md deleted file mode 100644 index 51f270d..0000000 --- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md +++ /dev/null @@ -1,134 +0,0 @@ -# Upgrade PP-StructureV3 Models - -## Why - -目前專案使用的 PP-StructureV3 模型配置存在以下問題: - -1. **版面偵測模型精度不足**:PP-DocLayout-S (70.9% mAP) 無法正確處理複雜表格和版面 -2. **表格識別準確率低**:SLANet (59.52%) 產出錯誤的 HTML 結構 -3. **預處理模組未啟用**:文檔方向校正和彎曲校正功能關閉 -4. **模型佔用空間過大**:下載了不使用的模型,浪費儲存空間 - -## What Changes - -### Stage 1: 預處理模組 - 全部開啟 - -| 功能 | 當前 | 變更後 | -|-----|-----|-------| -| `use_doc_orientation_classify` | False | **True** | -| `use_doc_unwarping` | False | **True** | -| `use_textline_orientation` | False | **True** | - -### Stage 2: OCR 模組 - 維持現狀 - -- 繼續使用 PP-OCRv5 (預設配置) -- 不需要更改 - -### Stage 3: 版面分析模組 - 升級模型選項 - -| 選項名稱 | 當前模型 | 變更後模型 | mAP | -|---------|---------|-----------|-----| -| `chinese` | PP-DocLayout-S (移除) | **PP-DocLayout_plus-L** | 83.2% | -| `default` | PubLayNet | PubLayNet (維持) | ~94% | -| `cdla` | CDLA | CDLA (維持) | ~86% | - -**重點變更**: -- 移除 PP-DocLayout-S (70.9% mAP) -- 新增 PP-DocLayout_plus-L (83.2% mAP, 20類別) -- 前端「中文文檔」選項改用 PP-DocLayout_plus-L - -### Stage 4: 元素識別模組 - 升級表格識別 - -| 模組 | 當前模型 | 變更後模型 | 準確率變化 | -|-----|---------|-----------|-----------| -| 表格識別 | SLANet (預設) | **SLANeXt_wired + SLANeXt_wireless** | 59.52% → 69.65% | -| 公式識別 | PP-FormulaNet (預設) | **PP-FormulaNet_plus-L** | 45.78% → 90.64% (中文) | -| 圖表解析 | PP-Chart2Table | PP-Chart2Table (維持) | - | -| 印章識別 | PP-OCRv4_seal | PP-OCRv4_seal (維持) | - | - -**表格識別策略**: -- SLANeXt_wired 和 SLANeXt_wireless 搭配使用 -- 先用分類器判斷有線/無線表格類型 -- 根據類型選擇對應的 SLANeXt 模型 -- 聯合測試準確率達 69.65% - -### 儲存空間優化 - 刪除未使用模型 - -PaddleOCR 3.x 模型緩存位置:`~/.paddlex/official_models/` - -可刪除的模型目錄: -- PP-DocLayout-S (被 PP-DocLayout_plus-L 取代) -- SLANet (被 SLANeXt 取代) -- 其他未使用的舊版模型 - -**注意**:刪除後首次使用新模型會觸發下載 - -## Requirements - -### REQ-1: 預處理模組開啟 -系統 **SHALL** 在 PP-StructureV3 初始化時啟用所有預處理功能: -- 文檔方向分類 (use_doc_orientation_classify=True) -- 文檔彎曲校正 (use_doc_unwarping=True) -- 文字行方向偵測 (use_textline_orientation=True) - -**Scenario: 處理旋轉的掃描文檔** -- Given 一個旋轉 90 度的 PDF 文檔 -- When 使用 OCR track 處理 -- Then 系統應自動校正方向後再進行 OCR - -### REQ-2: 版面模型升級 -系統 **SHALL** 將「chinese」選項對應的模型從 PP-DocLayout-S 更改為 PP-DocLayout_plus-L - -**Scenario: 處理中文複雜文檔** -- Given 包含表格、圖片、公式的中文文檔 -- When 選擇「chinese」版面模型處理 -- Then 應使用 PP-DocLayout_plus-L (83.2% mAP) 進行版面分析 - -### REQ-3: 表格識別升級 -系統 **SHALL** 使用 SLANeXt_wired 和 SLANeXt_wireless 搭配進行表格識別 - -**Scenario: 處理有線表格** -- Given 包含有線表格的文檔 -- When 進行表格結構識別 -- Then 應使用 SLANeXt_wired 模型 -- And 輸出正確的 HTML 表格結構 - -**Scenario: 處理無線表格** -- Given 包含無線表格的文檔 -- When 進行表格結構識別 -- Then 應使用 SLANeXt_wireless 模型 - -### REQ-4: 公式識別升級 -系統 **SHALL** 使用 PP-FormulaNet_plus-L 進行公式識別以支援中文公式 - -### REQ-5: 模型緩存清理 -系統 **SHOULD** 提供工具或文檔說明如何清理未使用的模型緩存以節省儲存空間 - -## Model Comparison Data - -### 表格識別模型對比 - -| 模型 | 準確率 | 推理時間 | 模型大小 | 適用場景 | -|-----|-------|---------|---------|---------| -| SLANet | 59.52% | 24ms | 6.9 MB | ❌ 準確率不足 | -| SLANet_plus | 63.69% | 23ms | 6.9 MB | ❌ 仍不足 | -| **SLANeXt_wired** | 69.65% | 86ms | 351 MB | ✅ 有線表格 | -| **SLANeXt_wireless** | 69.65% | - | 351 MB | ✅ 無線表格 | - -**結論**:SLANeXt 系列比 SLANet/SLANet_plus 準確率高約 10%,但模型大小增加約 50 倍。考慮到表格識別是核心功能,建議升級。 - -### 版面偵測模型對比 - -| 模型 | 類別數 | mAP | 推理時間 | 適用場景 | -|-----|-------|-----|---------|---------| -| PP-DocLayout-S | 23 | 70.9% | 12ms | ❌ 精度不足 | -| PP-DocLayout-L | 23 | 90.4% | 34ms | ✅ 通用高精度 | -| **PP-DocLayout_plus-L** | 20 | 83.2% | 53ms | ✅ 複雜文檔推薦 | - -## References - -- [PaddleOCR Table Structure Recognition](http://www.paddleocr.ai/main/en/version3.x/module_usage/table_structure_recognition.html) -- [SLANeXt_wired on HuggingFace](https://huggingface.co/PaddlePaddle/SLANeXt_wired) -- [SLANeXt_wireless on HuggingFace](https://huggingface.co/PaddlePaddle/SLANeXt_wireless) -- [PP-StructureV3 Technical Report](https://arxiv.org/html/2507.05595v1) -- [PaddleOCR Model Cache Issue](https://github.com/PaddlePaddle/PaddleOCR/issues/10234) diff --git a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md deleted file mode 100644 index 08a4d4e..0000000 --- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md +++ /dev/null @@ -1,56 +0,0 @@ -## ADDED Requirements - -### Requirement: PP-StructureV3 Configuration - -The system SHALL configure PP-StructureV3 with the following settings: - -**Preprocessing (Stage 1):** -- Document orientation classification MUST be enabled (`use_doc_orientation_classify=True`) -- Document unwarping MUST be enabled (`use_doc_unwarping=True`) -- Textline orientation detection MUST be enabled (`use_textline_orientation=True`) - -**Layout Detection (Stage 3):** -- The `chinese` layout model option SHALL use PP-DocLayout_plus-L (83.2% mAP) -- The `default` layout model option SHALL use PubLayNet for English documents -- The `cdla` layout model option SHALL use picodet_lcnet_x1_0_fgd_layout_cdla - -**Element Recognition (Stage 4):** -- Table structure recognition SHALL use SLANeXt_wired and SLANeXt_wireless models (69.65% combined accuracy) -- Formula recognition SHALL use PP-FormulaNet_plus-L (92.22% English, 90.64% Chinese BLEU) -- Chart parsing SHALL use PP-Chart2Table -- Seal recognition SHALL use PP-OCRv4_seal - -#### Scenario: Processing rotated scanned document -- **WHEN** a PDF document with rotated pages is processed using OCR track -- **THEN** the system SHALL automatically detect and correct the orientation before OCR processing - -#### Scenario: Processing complex Chinese document with tables -- **WHEN** a Chinese document containing tables, images, and formulas is processed -- **AND** the user selects "chinese" layout model -- **THEN** the system SHALL use PP-DocLayout_plus-L for layout detection (83.2% mAP) -- **AND** the system SHALL correctly identify table regions - -#### Scenario: Table structure recognition with wired tables -- **WHEN** a document contains wired (bordered) tables -- **THEN** the system SHALL use SLANeXt_wired model for structure recognition -- **AND** output correct HTML table structure with proper row/column spanning - -#### Scenario: Table structure recognition with wireless tables -- **WHEN** a document contains wireless (borderless) tables -- **THEN** the system SHALL use SLANeXt_wireless model for structure recognition - -#### Scenario: Chinese formula recognition -- **WHEN** a document contains mathematical formulas with Chinese characters -- **THEN** the system SHALL use PP-FormulaNet_plus-L for recognition -- **AND** output LaTeX code with correct Chinese character representation - -## ADDED Requirements - -### Requirement: Model Cache Cleanup - -The system SHALL provide documentation for cleaning up unused model caches to optimize storage space. - -#### Scenario: User wants to free disk space after model upgrade -- **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models -- **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/` -- **AND** list which model directories can be safely removed diff --git a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md deleted file mode 100644 index 792839d..0000000 --- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md +++ /dev/null @@ -1,77 +0,0 @@ -# Tasks: Upgrade PP-StructureV3 Models - -## 1. Backend Configuration Changes - -- [x] 1.1 Update `backend/app/core/config.py` - Enable preprocessing flags - - Set `use_doc_orientation_classify` default to True - - Set `use_doc_unwarping` default to True - - Set `use_textline_orientation` default to True - - Add `table_structure_model_name` configuration - - Add `formula_recognition_model_name` configuration - -- [x] 1.2 Update `backend/app/services/ocr_service.py` - Model mapping changes - - Update `LAYOUT_MODEL_MAPPING`: - - Change `"chinese"` from `"PP-DocLayout-S"` to `"PP-DocLayout_plus-L"` - - Keep `"default"` as PubLayNet - - Keep `"cdla"` as is - - Update `_ensure_structure_engine()`: - - Pass preprocessing flags to PPStructureV3 - - Configure SLANeXt models for table recognition - - Configure PP-FormulaNet_plus-L for formula recognition - -- [x] 1.3 Update PPStructureV3 initialization kwargs - - Add `table_structure_model_name="SLANeXt_wired"` (or configure dual model) - - Add `formula_recognition_model_name="PP-FormulaNet_plus-L"` - - Verify preprocessing flags are passed correctly - -## 2. Schema Updates - -- [x] 2.1 Update `backend/app/schemas/task.py` - LayoutModelEnum - - Rename or update `CHINESE` description to reflect PP-DocLayout_plus-L - - Update docstrings to reflect new model capabilities - -## 3. Frontend Updates - -- [x] 3.1 Update `frontend/src/components/LayoutModelSelector.tsx` - - Update Chinese option description to mention PP-DocLayout_plus-L - - Update accuracy information displayed to users - -- [x] 3.2 Update `frontend/src/i18n/locales/zh-TW.json` - - Update `layoutModel.chinese.description` to reflect new model - - Update any accuracy percentages in descriptions - -## 4. Testing - -- [x] 4.1 Create unit tests for new model configuration - - Test preprocessing flags are correctly passed - - Test model mapping resolves correctly - - Test engine initialization with new models - -- [ ] 4.2 Integration testing with real documents - - Test rotated document handling (preprocessing) - - Test complex Chinese document layout detection - - Test table structure recognition accuracy - - Test formula recognition with Chinese formulas - -- [x] 4.3 Update existing tests - - Update `backend/tests/services/test_layout_model.py` for new mapping - - Update `backend/tests/api/test_layout_model_api.py` if needed - -## 5. Documentation - -- [x] 5.1 Create model cleanup documentation - - Document `~/.paddlex/official_models/` cache location - - List models that can be safely deleted after upgrade - - Provide cleanup script/commands - - See: [MODEL_CLEANUP.md](./MODEL_CLEANUP.md) - -- [x] 5.2 Update API documentation - - Document preprocessing feature behavior - - Update layout model descriptions - -## 6. Verification & Deployment - -- [ ] 6.1 Verify new models download correctly on first use -- [ ] 6.2 Measure memory/GPU usage with new models -- [ ] 6.3 Compare processing speed before/after upgrade -- [ ] 6.4 Verify existing functionality not broken diff --git a/openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md b/openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md deleted file mode 100644 index 57fb005..0000000 --- a/openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md +++ /dev/null @@ -1,72 +0,0 @@ -# Change: Unify Image Scaling Strategy for Optimal Layout Detection - -## Why - -Currently, the system has inconsistent image resolution handling: - -1. **PDF conversion**: Always uses 300 DPI, producing ~2480×3508 images for A4 -2. **Image downscaling**: Only applied when image > 2000px (no upscaling) -3. **Small images**: Never scaled up, even if they're below optimal detection size - -This inconsistency causes: -- Wasted processing: PDF→300DPI→scale down to 1600px (double conversion) -- Suboptimal detection: Small images stay small, missing table structures -- Inconsistent behavior: Different source formats get different treatment - -PP-Structure's layout detection model (RT-DETR based) works best with images around 1600px on the longest side. Both too-large and too-small images reduce detection accuracy. - -## What Changes - -- **Bidirectional scaling for PP-Structure** - - Scale DOWN images larger than max threshold (2000px) → target (1600px) - - Scale UP images smaller than min threshold (1200px) → target (1600px) - - No change for images in optimal range (1200-2000px) - -- **PDF conversion DPI optimization** - - Calculate optimal DPI based on target resolution - - Avoid double-scaling (convert at high DPI then scale down) - - Option to use adaptive DPI or fixed DPI with post-scaling - -- **Unified scaling logic** - - Same rules apply to all image sources (IMG, PDF pages) - - Scaling happens once at preprocessing stage - - Bbox coordinates scaled back to original for accurate cropping - -- **Configuration** - - `layout_image_scaling_min_dimension`: Minimum size before upscaling (default: 1200) - - Keep existing `layout_image_scaling_max_dimension` (2000) and `target_dimension` (1600) - -## Impact - -### Affected Specs -- `ocr-processing` - Modified scaling requirements - -### Affected Code -- `backend/app/core/config.py` - Add min_dimension setting -- `backend/app/services/layout_preprocessing_service.py` - Add upscaling logic -- `backend/app/services/ocr_service.py` - Optional: Adjust PDF DPI handling - -### Quality Impact - -| Scenario | Before | After | -|----------|--------|-------| -| Large image (3000px) | Scaled to 1600px | Same | -| Optimal image (1500px) | No scaling | Same | -| Small image (800px) | No scaling | Scaled to 1600px | -| PDF at 300 DPI | 2480px → 1600px | Same (or optimized DPI) | - -### Raw OCR Impact -- No change: Raw OCR continues to use original/converted images -- Upscaling only affects PP-Structure layout detection input - -## Risks - -1. **Upscaling quality**: Enlarging small images may introduce interpolation artifacts - - Mitigation: Use INTER_CUBIC or INTER_LANCZOS4 for upscaling - - Note: Layout detection cares about structure, not fine text detail - -2. **Memory for large upscaled images**: Small image scaled up uses more memory - - Mitigation: 800px → 1600px is 4x pixels, but 1600px is still reasonable - -3. **Breaking existing behavior**: Users may rely on current behavior - - Mitigation: Document the change, add config toggle if needed diff --git a/openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md deleted file mode 100644 index 370d145..0000000 --- a/openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md +++ /dev/null @@ -1,42 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Image Scaling for Layout Detection - -The system SHALL apply bidirectional image scaling to optimize PP-Structure layout detection accuracy: - -1. Images with longest side > `layout_image_scaling_max_dimension` (default: 2000px) SHALL be scaled DOWN to `layout_image_scaling_target_dimension` (default: 1600px) - -2. Images with longest side < `layout_image_scaling_min_dimension` (default: 1200px) SHALL be scaled UP to `layout_image_scaling_target_dimension` (default: 1600px) - -3. Images within the optimal range (min_dimension to max_dimension) SHALL NOT be scaled - -4. For downscaling, the system SHALL use `cv2.INTER_AREA` interpolation (best for shrinking) - -5. For upscaling, the system SHALL use `cv2.INTER_CUBIC` interpolation (smooth enlargement) - -6. The system SHALL track the scale factor and restore bounding box coordinates to original image space after layout detection - -7. Raw OCR and element extraction SHALL continue to use original/unscaled images - -#### Scenario: Large image is scaled down -- **WHEN** an image has max dimension 2480px (> 2000px threshold) -- **THEN** the image is scaled down to ~1600px on longest side -- **AND** scale_factor is recorded as ~2.19 for bbox restoration -- **AND** INTER_AREA interpolation is used - -#### Scenario: Small image is scaled up -- **WHEN** an image has max dimension 800px (< 1200px threshold) -- **THEN** the image is scaled up to ~1600px on longest side -- **AND** scale_factor is recorded as ~0.5 for bbox restoration -- **AND** INTER_CUBIC interpolation is used - -#### Scenario: Optimal size image is not scaled -- **WHEN** an image has max dimension 1500px (within 1200-2000px range) -- **THEN** the image is NOT scaled -- **AND** scale_factor is 1.0 -- **AND** was_scaled is False - -#### Scenario: Bbox coordinates are restored after scaling -- **WHEN** layout detection returns bbox [100, 200, 500, 600] on scaled image -- **AND** scale_factor is 2.0 (image was scaled down by 0.5) -- **THEN** final bbox is [200, 400, 1000, 1200] in original image coordinates diff --git a/openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md b/openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md deleted file mode 100644 index 3e953f0..0000000 --- a/openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md +++ /dev/null @@ -1,113 +0,0 @@ -# Tasks: Unify Image Scaling Strategy - -## 1. Configuration - -- [x] 1.1 Add min_dimension setting to `backend/app/core/config.py` - - `layout_image_scaling_min_dimension: int = 1200` - - Description: "Min dimension (pixels) before upscaling. Images smaller than this will be scaled up." - -## 2. Bidirectional Scaling Logic - -- [x] 2.1 Update `scale_for_layout_detection()` in `layout_preprocessing_service.py` - - Add upscaling condition: `max_dim < min_dimension` - - Use `cv2.INTER_CUBIC` for upscaling (better quality than INTER_LINEAR) - - Update docstring to reflect bidirectional behavior - -- [x] 2.2 Update scaling decision logic - ```python - # Current: only downscale - should_scale = max_dim > max_dimension - - # New: bidirectional - should_downscale = max_dim > max_dimension - should_upscale = max_dim < min_dimension - should_scale = should_downscale or should_upscale - ``` - -- [x] 2.3 Update logging to indicate scale direction - - "Scaled DOWN for layout detection: 2480x3508 -> 1131x1600" - - "Scaled UP for layout detection: 800x600 -> 1600x1200" - -## 3. PDF DPI Handling (Optional Optimization) - -- [x] 3.1 Evaluate current PDF conversion impact - - Decision: Keep 300 DPI, let bidirectional scaling handle it - - Reason: Raw OCR benefits from high resolution, scaling handles PP-Structure needs - -- [x] 3.2 Option A: Keep 300 DPI, let scaling handle it ✓ - - Simplest approach, no change needed - - Raw OCR benefits from high resolution - -- [ ] ~~3.3 Option B: Add configurable PDF DPI~~ (Not needed) - -## 4. Testing - -- [x] 4.1 Test upscaling with small images - - Small image (800x600): Scaled UP → 1600x1200, scale_factor=0.500 - - Very small (400x300): Scaled UP → 1600x1200, scale_factor=0.250 - -- [x] 4.2 Test no scaling for optimal range - - Optimal image (1500x1000): was_scaled=False, scale_factor=1.000 - -- [x] 4.3 Test downscaling (existing behavior) - - Large image (2480x3508): Scaled DOWN → 1131x1600, scale_factor=2.192 - -- [ ] 4.4 Test PDF workflow (manual test recommended) - - PDF page should be detected correctly - - Scaling should apply after PDF conversion - -## 5. Documentation - -- [x] 5.1 Update config.py Field descriptions - - Explained bidirectional scaling in enabled field description - - Updated max/min/target descriptions - -- [x] 5.2 Add logging for scaling decisions - - Logs direction (UP/DOWN), original size, target size, scale_factor - ---- - -## Implementation Summary - -**Files Modified:** -- `backend/app/core/config.py` - Added `layout_image_scaling_min_dimension` setting -- `backend/app/services/layout_preprocessing_service.py` - Updated bidirectional scaling logic - -**Test Results (2025-11-27):** -| Test Case | Original | Result | scale_factor | -|-----------|----------|--------|--------------| -| Small (800×600) | max=800 < 1200 | UP → 1600×1200 | 0.500 | -| Optimal (1500×1000) | 1200 ≤ 1500 ≤ 2000 | No scaling | 1.000 | -| Large (2480×3508) | max=3508 > 2000 | DOWN → 1131×1600 | 2.192 | -| Very small (400×300) | max=400 < 1200 | UP → 1600×1200 | 0.250 | - ---- - -## Implementation Notes - -### Scaling Decision Matrix - -| Image Size | Action | Scale Factor | Interpolation | -|------------|--------|--------------|---------------| -| < 1200px | Scale UP | target/max_dim | INTER_CUBIC | -| 1200-2000px | No scaling | 1.0 | N/A | -| > 2000px | Scale DOWN | target/max_dim | INTER_AREA | - -### Example Scenarios - -1. **Small scan (800×600)** - - max_dim = 800 < 1200 → Scale UP - - target = 1600, scale = 1600/800 = 2.0 - - Result: 1600×1200 - - scale_factor (for bbox restore) = 0.5 - -2. **Optimal image (1400×1000)** - - max_dim = 1400, 1200 <= 1400 <= 2000 → No scaling - - Result: unchanged - - scale_factor = 1.0 - -3. **High-res scan (2480×3508)** - - max_dim = 3508 > 2000 → Scale DOWN - - target = 1600, scale = 1600/3508 = 0.456 - - Result: 1131×1600 - - scale_factor (for bbox restore) = 2.19 diff --git a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/proposal.md b/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/proposal.md deleted file mode 100644 index 149667e..0000000 --- a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/proposal.md +++ /dev/null @@ -1,134 +0,0 @@ -# Change: Extract Table Cell Boxes via Direct Model Invocation - -## Why - -PPStructureV3 (PaddleX 3.x) 的高層 API 在處理表格時,只輸出 HTML 格式的表格內容,**不返回每個 cell 的座標 (bbox)**。 - -### 問題分析 - -經過測試確認: - -```python -# PPStructureV3 輸出 (parsing_res_list) -{ - 'block_label': 'table', - 'block_content': '...', # 只有 HTML - 'block_bbox': [84, 269, 1174, 1508], # 只有整個表格的 bbox - # ❌ 沒有 cell boxes -} -``` - -但底層模型 (SLANeXt) 實際上**有輸出 cell boxes**: - -```python -# 直接調用 SLANeXt 模型 -from paddlex import create_model -table_model = create_model('SLANeXt_wired') -result = table_model.predict(table_img) -# result.json['res']['bbox'] → 29 個 cell 座標 (8點多邊形) -``` - -### 影響 - -缺少 cell boxes 導致: -- OCR Track 的 PDF 版面還原表格渲染不準確 -- 無法精確定位每個 cell 的位置 -- 表格內容可能重疊或錯位 - -## What Changes - -### 方案:補充調用底層 SLANeXt 模型 - -在 `pp_structure_enhanced.py` 處理表格時,補充調用 PaddleX 底層模型獲取 cell boxes: - -``` -┌─────────────────────────────────────────────────────────────┐ -│ 修改後的流程 │ -├─────────────────────────────────────────────────────────────┤ -│ │ -│ PPStructureV3.predict() │ -│ │ │ -│ ▼ │ -│ parsing_res_list (HTML only) │ -│ │ │ -│ ▼ (對於 TABLE 類型) │ -│ ┌─────────────────────────────────────┐ │ -│ │ 補充調用底層模型 │ │ -│ │ 1. 裁切表格區域 │ │ -│ │ 2. 調用 SLANeXt 獲取 cell boxes │ │ -│ │ 3. 轉換座標到全域座標 │ │ -│ │ 4. 存入 element['cell_boxes'] │ │ -│ └─────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ 完整的表格元素 (HTML + cell_boxes) │ -│ │ -└─────────────────────────────────────────────────────────────┘ -``` - -### 模型選擇邏輯 - -根據表格類型選擇對應的 SLANeXt 模型: - -| 表格類型 | 判斷方式 | 使用模型 | -|---------|---------|---------| -| 有線表格 (wired) | PP-LCNet 分類 | SLANeXt_wired | -| 無線表格 (wireless) | PP-LCNet 分類 | SLANeXt_wireless | - -### Cell Boxes 格式 - -SLANeXt 輸出的 bbox 是 8 點多邊形格式: -```python -[x1, y1, x2, y2, x3, y3, x4, y4] # 四個角點座標 -# 例如: [11, 4, 692, 5, 675, 57, 10, 56] -``` - -需要轉換為全域座標(加上表格偏移量)。 - -## Impact - -### Affected Specs -- `ocr-processing` - 表格處理增強 - -### Affected Code -- `backend/app/services/pp_structure_enhanced.py` - - 添加底層模型緩存機制 - - 修改 `_process_parsing_res_list` 中的 TABLE 處理邏輯 - - 添加 cell boxes 提取和座標轉換 - -- `backend/app/services/pdf_generator_service.py` - - 利用 cell_boxes 改進表格渲染 - -### Quality Impact - -| 項目 | 改進前 | 改進後 | -|------|--------|--------| -| Cell 座標 | ❌ 無 | ✅ 有 (8點多邊形) | -| 表格渲染 | 平均分配行列 | 精確定位 | -| 版面還原 | 內容可能重疊 | 準確對應 | - -### Performance Impact - -- 額外模型調用:每個表格需要額外調用一次 SLANeXt -- 緩存優化:模型實例可緩存,避免重複載入 -- 預估開銷:每表格增加 ~0.5-1 秒 - -## Risks - -1. **性能開銷** - - 風險:額外模型調用增加處理時間 - - 緩解:緩存模型實例,僅在需要時調用 - -2. **模型不一致** - - 風險:PPStructureV3 內部可能已使用不同參數的模型 - - 緩解:使用相同的模型配置 - -3. **座標轉換錯誤** - - 風險:bbox 座標系可能有差異 - - 緩解:充分測試,確保座標正確轉換 - -## Not Included - -- 完全繞過 PPStructureV3(保留用於 Layout 分析) -- RT-DETR cell detection(可作為後續增強) -- 其他元素的增強處理 diff --git a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/specs/ocr-processing/spec.md deleted file mode 100644 index 46812c8..0000000 --- a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/specs/ocr-processing/spec.md +++ /dev/null @@ -1,132 +0,0 @@ -# Spec: OCR Processing - Table Cell Boxes Extraction - -## Overview - -在 OCR Track 處理表格時,補充調用 PaddleX 底層 SLANeXt 模型,獲取每個 cell 的座標信息。 - -## Requirements - -### 1. 模型管理 - -#### 1.1 模型緩存 -```python -class PPStructureEnhanced: - def __init__(self, structure_engine): - self.structure_engine = structure_engine - # 底層模型緩存 - self._table_cls_model = None - self._wired_table_model = None - self._wireless_table_model = None -``` - -#### 1.2 延遲載入 -- 模型只在首次需要時載入 -- 使用 `paddlex.create_model()` API -- 模型配置從 settings 讀取 - -### 2. Cell Boxes 提取流程 - -#### 2.1 處理條件 -當 `mapped_type == ElementType.TABLE` 且有有效的 `block_bbox` 時觸發。 - -#### 2.2 處理步驟 - -``` -1. 裁切表格圖片 - - 從原始圖片中根據 block_bbox 裁切 - - 確保邊界不超出圖片範圍 - -2. 判斷表格類型 (可選) - - 調用 PP-LCNet_x1_0_table_cls - - 獲取 wired/wireless 分類結果 - - 或直接使用 PPStructureV3 內部的分類結果 - -3. 調用對應 SLANeXt 模型 - - wired → SLANeXt_wired - - wireless → SLANeXt_wireless - -4. 提取 cell boxes - - 從 result.json['res']['bbox'] 獲取 - - 格式: [[x1,y1,x2,y2,x3,y3,x4,y4], ...] - -5. 座標轉換 - - 將相對座標轉為全域座標 - - global_box = [box[i] + offset for each point] - - offset = (table_x, table_y) from block_bbox - -6. 存入 element - - element['cell_boxes'] = processed_boxes - - element['cell_boxes_format'] = 'polygon_8' -``` - -### 3. 數據格式 - -#### 3.1 Cell Boxes 結構 -```python -element = { - 'element_id': 'pp3_0_3', - 'type': ElementType.TABLE, - 'bbox': [84, 269, 1174, 1508], # 表格整體 bbox - 'content': '...', # HTML 內容 - 'cell_boxes': [ # 新增:cell 座標 - [95, 273, 776, 274, 759, 326, 94, 325], # cell 0 (全域座標) - [119, 296, 575, 295, 560, 399, 117, 401], # cell 1 - # ... - ], - 'cell_boxes_format': 'polygon_8', # 座標格式說明 - 'table_type': 'wired', # 可選:表格類型 -} -``` - -#### 3.2 座標格式 -- `polygon_8`: 8 點多邊形 `[x1,y1,x2,y2,x3,y3,x4,y4]` -- 順序:左上 → 右上 → 右下 → 左下 - -### 4. 錯誤處理 - -#### 4.1 失敗情況 -- 模型載入失敗 -- 圖片裁切失敗 -- 預測返回空結果 - -#### 4.2 處理方式 -- 記錄警告日誌 -- 繼續處理,element 不包含 cell_boxes -- 不影響原有 HTML 提取流程 - -### 5. 配置項 - -```python -# config.py -class Settings: - # 是否啟用 cell boxes 提取 - enable_table_cell_boxes_extraction: bool = True - - # 表格結構識別模型 (已存在) - wired_table_model_name: str = "SLANeXt_wired" - wireless_table_model_name: str = "SLANeXt_wireless" -``` - -## Implementation Notes - -### 模型共享 -PPStructureV3 內部已載入了這些模型,但高層 API 不暴露。 -直接使用 `paddlex.create_model()` 會重新載入模型。 -考慮是否可以訪問 PPStructureV3 內部的模型實例(經測試:不可行)。 - -### 性能優化 -- 模型實例緩存在 PPStructureEnhanced 中 -- 避免每次處理表格都重新載入模型 -- 考慮在內存緊張時釋放緩存 - -### 座標縮放 -如果圖片在 Layout 分析前經過縮放(ScalingInfo), -cell boxes 座標也需要相應縮放回原始座標系。 - -## Test Cases - -1. **有線表格**:確認 cell boxes 提取正確 -2. **無線表格**:確認模型選擇和提取正確 -3. **複雜表格**:跨行跨列的表格 -4. **小表格**:cell 數量少的簡單表格 -5. **錯誤處理**:無效 bbox、模型失敗等情況 diff --git a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/tasks.md b/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/tasks.md deleted file mode 100644 index 7208f17..0000000 --- a/openspec/changes/archive/2025-11-30-extract-table-cell-boxes/tasks.md +++ /dev/null @@ -1,273 +0,0 @@ -# Tasks: Extract Table Cell Boxes - -## 重要發現 (2025-11-28) - -**PPStructureV3 (PaddleX 3.3.9) 確實提供 `table_res_list`!** - -之前的實現假設需要額外調用 SLANeXt 模型,但經過深入測試發現: -- `result.json['res']['table_res_list']` 包含所有表格的 `cell_box_list` -- 不需要額外的模型調用 -- 已移除多餘的 SLANeXt 代碼 - -## Phase 1: 基礎設施 (已完成) - -### Task 1.1: 配置項 -- [x] ~~添加 `enable_table_cell_boxes_extraction` 配置~~ (已移除,不再需要) -- [x] 確認 PPStructureV3 提供 `table_res_list` - -### Task 1.2: 模型緩存機制 -- [x] ~~實現 SLANeXt 模型緩存~~ (已移除,不再需要) -- [x] 直接使用 PPStructureV3 內建的 `table_res_list` - -## Phase 2: Cell Boxes 提取 (已完成) - -### Task 2.1: 從 table_res_list 提取 -- [x] 從 `result.json['res']['table_res_list']` 獲取 `cell_box_list` -- [x] 通過 HTML 內容匹配表格 -- [x] 驗證座標格式 (已是絕對座標) - -### Task 2.2: Image-in-Table 處理 -- [x] 從 `layout_det_res` 獲取 image boxes -- [x] 檢測表格內的圖片 -- [x] 裁切保存圖片 -- [x] 嵌入到表格 HTML - -## Phase 3: PDF 生成優化 (已完成) - -### Task 3.1: ~~利用 Cell Boxes 推斷網格~~ (已棄用) -- [x] ~~修改 `draw_table_region` 使用 cell_boxes~~ -- [x] ~~根據實際 cell 位置計算行高列寬~~ -- [x] 測試渲染效果 → **發現問題:HTML 結構與 cell_boxes 不匹配** - -### Task 3.2: 方案 B - 分層渲染 (Layered Rendering) ✓ 已完成 - -**問題分析 (2025-11-30)**: -- HTML 表格結構與 cell_boxes 不匹配,無法正確推斷網格 -- 嘗試在 cell 內繪製文字失敗(超出邊框、匹配錯誤) - -**解決方案**:分層渲染 - 分離表格邊框與文字繪製 -- Layer 1: 使用 cell_boxes 繪製表格邊框 -- Layer 2: 使用 raw OCR positions 繪製文字(獨立於表格結構) -- Layer 3: 繪製 embedded_images - -**實作步驟 (2025-11-30)**: -- [x] 修改 `GapFillingService._is_region_covered()` - 跳過 TABLE 元素覆蓋檢測 -- [x] 簡化 `_draw_table_with_cell_boxes()` - 只繪製邊框 + 圖片 -- [x] 修改 `regions_to_avoid` - 排除表格,讓文字穿透表格區域 -- [x] 整合測試:test_layered_rendering.py - -### Task 3.3: 備選方案 -- [x] 當 cell_boxes 不可用時,使用 ReportLab Table -- [x] 確保向後兼容 - -## Phase 4: 測試與驗證 (已完成) - -### Task 4.1: 單元測試 -- [x] 測試 cell_box_list 提取 (29 cells 成功) -- [x] 測試 Image-in-Table 處理 (1 image embedded) -- [x] 測試錯誤處理 - -### Task 4.2: 整合測試 -- [x] 使用實際 PDF 測試 OCR Track (test_layered_rendering.py) -- [x] 驗證 PDF 版面還原效果 -- [x] 分層渲染測試結果: - - 50 text elements (從 raw OCR 補充,原本只有 5 個) - - 31 cell_boxes (8 + 23) - - 1 embedded_image - - PDF 生成成功 (57,290 bytes) - -## Phase 5: 清理 (已完成) - -### Task 5.1: 移除舊代碼 -- [x] 移除 SLANeXt 模型緩存代碼 -- [x] 移除 `_get_slanet_model()`, `_get_table_classifier()`, `_extract_cell_boxes_with_slanet()`, `release_slanet_models()` -- [x] 移除 `enable_table_cell_boxes_extraction` 配置 -- [x] 清理調試日誌 - ---- - -## 技術細節 - -### 關鍵代碼位置 - -| 文件 | 修改內容 | -|------|---------| -| `backend/app/core/config.py` | 移除 `enable_table_cell_boxes_extraction` | -| `backend/app/services/pp_structure_enhanced.py` | 使用 `table_res_list`, 添加 `_embed_images_in_table()` | -| `backend/app/services/pdf_generator_service.py` | 分層渲染:只繪製邊框,排除表格區域的文字過濾 | -| `backend/app/services/gap_filling_service.py` | `_is_region_covered()` 跳過 TABLE 元素 | -| `backend/tests/test_layered_rendering.py` | 分層渲染整合測試 | - -### PPStructureV3 數據結構 - -```python -result.json = { - 'res': { - 'parsing_res_list': [...], # 解析結果 - 'layout_det_res': {...}, # Layout 檢測結果 - 'table_res_list': [ # 表格識別結果 - { - 'cell_box_list': [[x1,y1,x2,y2], ...], # ← 關鍵! - 'pred_html': '...', - 'table_ocr_pred': {...} - } - ], - 'overall_ocr_res': {...} - } -} -``` - -### 測試結果 - -- Task ID: `442f9345-09ba-4a7d-949f-3bc88c2fa895` -- cell_boxes: 29 cells (source: table_res_list) -- embedded_images: 1 (img_in_table_935_838_1118_1031) - -### 本地 vs 雲端差異 - -| 特性 | 本地 PaddleX 3.3.9 | 雲端 pp_demo | -|------|-------------------|--------------| -| `table_res_list` | ✓ 提供 | ✓ 提供 | -| `cell_box_list` | ✓ 29 cells | ✓ 27+8 cells | -| Layout 識別 | 1 個合併表格 | 2 個獨立表格 | -| Image-in-Table | 需自行處理 | 自動嵌入 HTML | - -### 遺留問題 - -1. **Layout 識別合併表格**:本地 Layout 模型把多個表格合併成一個大表格 - - 這導致 `table_res_list` 只有 1 個表格 - - 雲端識別為 2 個獨立表格 - - 可能需要調整 Layout 模型參數或後處理邏輯 - ---- - -## 分層渲染技術設計 (2025-11-30) - -### 問題根因 - -ReportLab Table 需要規則矩形網格,但 PPStructureV3 的 cell_boxes 反映實際視覺位置,與 HTML 邏輯結構不匹配。嘗試在 cell 內繪製文字會導致: -- 文字超出邊框 -- 匹配錯誤 -- 部分文字遺失 - -### 解決方案:分層渲染 - -將表格渲染解耦為三個獨立層次: - -``` -┌─────────────────────────────────────────┐ -│ Layer 3: Embedded Images │ -│ (從 metadata['embedded_images'] 獲取) │ -├─────────────────────────────────────────┤ -│ Layer 2: Text at Raw OCR Positions │ -│ (從 GapFillingService 補充的原始 OCR) │ -├─────────────────────────────────────────┤ -│ Layer 1: Table Cell Borders │ -│ (從 metadata['cell_boxes'] 繪製) │ -└─────────────────────────────────────────┘ -``` - -### 實作細節 - -**1. GapFillingService 修改** (`_is_region_covered`): -```python -# 跳過 TABLE 元素覆蓋檢測,讓表格內文字通過 -if skip_table_coverage and element.type == ElementType.TABLE: - continue -``` - -**2. PDF Generator 修改** (`regions_to_avoid`): -```python -# 排除表格,只避免與圖片重疊 -regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table'] -``` - -**3. 簡化的 `_draw_table_with_cell_boxes`**: -```python -def _draw_table_with_cell_boxes(...): - """只繪製邊框和圖片,不處理文字""" - # 1. 繪製每個 cell 的邊框 - for box in cell_boxes: - pdf_canvas.rect(x, y, width, height, stroke=1, fill=0) - - # 2. 繪製 embedded_images - for img in embedded_images: - self._draw_embedded_image(...) -``` - -### 優勢 - -1. **解耦**:邊框渲染與文字渲染完全獨立 -2. **精確**:文字位置直接使用 OCR 結果,不需推斷 -3. **穩定**:不受 cell_boxes 與 HTML 不匹配影響 -4. **相容**:visualization 中 overall_ocr_res.png 的效果可直接還原 - -### 測試結果 - -- Task ID: `84899366-f361-44f1-b989-5aba72419ca5` -- cell_boxes: 31 (8 + 23) -- 原始 text elements: 5 -- 補充後 text elements: 50 (從 raw OCR 補充) -- PDF 大小: 57,290 bytes - ---- - -## 混合渲染優化 (2025-11-30) - -### 問題發現 - -分層渲染後仍有問題: -1. 表格歪斜:cell_boxes 有 2-11 像素的座標偏差 -2. Title 等元素樣式未應用:OCR track 不套用樣式 - -### 解決方案:混合渲染 + 網格對齊 - -**1. Cell Boxes 網格對齊** (`_normalize_cell_boxes_to_grid`): -```python -def _normalize_cell_boxes_to_grid(self, cell_boxes, threshold=10.0): - """ - 將相鄰座標聚合為統一值,消除 2-11 像素的偏差。 - - 收集所有 X/Y 座標 - - 聚類相近座標(threshold 內) - - 使用平均值作為對齊後的座標 - """ -``` - -**2. 元素類型樣式** (OCR track): -```python -# 在 draw_text_region 中加入元素類型檢查 -element_type = region.get('element_type', 'text') - -if element_type == 'title': - font_size = min(font_size * 1.3, 36) # 30% 放大 -elif element_type == 'header': - font_size = min(font_size * 1.15, 24) # 15% 放大 -elif element_type == 'caption': - font_size = max(font_size * 0.9, 6) # 10% 縮小 -``` - -**3. 元素類型傳遞**: -```python -# convert_unified_document_to_ocr_data 中加入 -text_region = { - 'text': text_content, - 'bbox': bbox_polygon, - 'element_type': element.type.value # 新增 -} -``` - -### 改進後效果 - -| 項目 | 改進前 | 改進後 | -|------|--------|--------| -| 表格邊框 | 歪斜 (2-11px 偏差) | 網格對齊 | -| Title 樣式 | 無 (與普通文字相同) | 36pt 放大字體 | -| 混合渲染 | 只用 raw OCR | PP-Structure + raw OCR | - -### 測試結果 (2025-11-30) - -- Task ID: `3a3f350f-2d81-4af4-8a18-021ea09ac433` -- Table 1: 8 cell_boxes → 網格對齊 -- Table 2: 23 cell_boxes → 網格對齊 + 1 embedded image -- Title: Applied title style: size=36.0 -- PDF 大小: 104,082 bytes diff --git a/openspec/changes/archive/2025-12-02-add-document-translation/design.md b/openspec/changes/archive/2025-12-02-add-document-translation/design.md deleted file mode 100644 index 5b66bfb..0000000 --- a/openspec/changes/archive/2025-12-02-add-document-translation/design.md +++ /dev/null @@ -1,265 +0,0 @@ -# Design: Document Translation Feature - -## Context - -Tool_OCR processes documents through three tracks (Direct/OCR/Hybrid) and outputs UnifiedDocument JSON. Users need translation capability to convert extracted text into different languages while preserving document structure. - -### Constraints -- Must use DIFY AI service for translation -- API-based solution (no local model management) -- Translation quality depends on DIFY's underlying model - -### Stakeholders -- End users: Need translated documents -- System: Simple HTTP-based integration - -## Goals / Non-Goals - -### Goals -- Translate documents using DIFY AI API -- Preserve document structure (element positions, formatting) -- Support all three processing tracks with unified logic -- Real-time progress feedback to users -- Simple, maintainable API integration - -### Non-Goals -- Local model inference (replaced by DIFY API) -- GPU memory management (not needed) -- Translation memory or glossary support -- Concurrent translation processing - -## Decisions - -### Decision 1: Translation Provider - -**Choice**: DIFY AI Service (theaken.com) - -**Configuration**: -- Base URL: `https://dify.theaken.com/v1` -- Endpoint: `POST /chat-messages` -- API Key: `app-YOPrF2ro5fshzMkCZviIuUJd` -- Mode: Chat (Blocking response) - -**Rationale**: -- High-quality cloud AI translation -- No local model management required -- No GPU memory concerns -- Easy to maintain and update - -### Decision 2: Response Mode - -**Choice**: Blocking Mode - -**API Request Format**: -```json -{ - "inputs": {}, - "query": "Translate the following text to Chinese:\n\nHello world", - "response_mode": "blocking", - "conversation_id": "", - "user": "tool-ocr-{task_id}" -} -``` - -**API Response Format**: -```json -{ - "event": "message", - "answer": "你好世界", - "conversation_id": "xxx", - "metadata": { - "usage": { - "total_tokens": 54, - "latency": 1.26 - } - } -} -``` - -**Rationale**: -- Simpler implementation than streaming -- Adequate for batch text translation -- Complete response in single call - -### Decision 3: Translation Batch Format - -**Choice**: Single text per request with translation prompt - -**Request Format**: -``` -Translate the following text to {target_language}. -Return ONLY the translated text, no explanations. - -{text_content} -``` - -**Rationale**: -- Clear instruction for AI -- Predictable response format -- Easy to parse result - -### Decision 4: Translation Result Storage - -**Choice**: Independent JSON file per language (unchanged from previous design) - -``` -backend/storage/results/{task_id}/ -├── xxx_result.json # Original -├── xxx_translated_en.json # English translation -├── xxx_translated_ja.json # Japanese translation -└── ... -``` - -**Rationale**: -- Non-destructive (original preserved) -- Multiple languages supported -- Easy to manage and delete -- Clear file naming convention - -### Decision 5: Element Type Handling - -**Translatable types** (content is string): -- `text`, `title`, `header`, `footer`, `paragraph`, `footnote` - -**Special handling** (content is dict): -- `table` -> Translate `cells[].content` - -**Skip** (non-text content): -- `page_number`, `image`, `chart`, `logo`, `reference` - -## Architecture - -### Component Diagram - -``` -┌─────────────────────────────────────────────────────────────┐ -│ Frontend │ -│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ -│ │ TaskDetail │ │ TranslateBtn │ │ ProgressDisplay │ │ -│ └─────────────┘ └──────────────┘ └─────────────────────┘ │ -└────────────────────────────┬────────────────────────────────┘ - │ HTTP -┌────────────────────────────▼────────────────────────────────┐ -│ Backend API │ -│ ┌─────────────────────────────────────────────────────────┐│ -│ │ TranslateRouter ││ -│ │ POST /api/v2/translate/{task_id} ││ -│ │ GET /api/v2/translate/{task_id}/status ││ -│ │ GET /api/v2/translate/{task_id}/result ││ -│ └─────────────────────────────────────────────────────────┘│ -└────────────────────────────┬────────────────────────────────┘ - │ -┌────────────────────────────▼────────────────────────────────┐ -│ TranslationService │ -│ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │ -│ │ DifyClient │ │ BatchBuilder │ │ ResultParser │ │ -│ │ - translate() │ │ - extract() │ │ - parse() │ │ -│ │ - chat() │ │ - format() │ │ - map_ids() │ │ -│ └───────────────┘ └───────────────┘ └─────────────────┘ │ -└────────────────────────────┬────────────────────────────────┘ - │ HTTPS -┌────────────────────────────▼────────────────────────────────┐ -│ DIFY AI Service │ -│ https://dify.theaken.com/v1 │ -│ (Chat - Blocking) │ -└─────────────────────────────────────────────────────────────┘ -``` - -### Translation JSON Schema - -```json -{ - "schema_version": "1.0.0", - "source_document": "xxx_result.json", - "source_lang": "auto", - "target_lang": "en", - "provider": "dify", - "translated_at": "2025-12-02T12:00:00Z", - "statistics": { - "total_elements": 50, - "translated_elements": 45, - "skipped_elements": 5, - "total_characters": 5000, - "processing_time_seconds": 30.5, - "total_tokens": 2500 - }, - "translations": { - "pp3_0_0": "Company Profile", - "pp3_0_1": "Founded in 2020...", - "table_1_0": { - "cells": [ - {"row": 0, "col": 0, "content": "Technology"}, - {"row": 0, "col": 1, "content": "Epoxy"} - ] - } - } -} -``` - -### Language Code Mapping - -```python -LANGUAGE_NAMES = { - "en": "English", - "zh-TW": "Traditional Chinese", - "zh-CN": "Simplified Chinese", - "ja": "Japanese", - "ko": "Korean", - "de": "German", - "fr": "French", - "es": "Spanish", - "pt": "Portuguese", - "it": "Italian", - "ru": "Russian", - "vi": "Vietnamese", - "th": "Thai", - # Additional languages as needed -} -``` - -## Risks / Trade-offs - -### Risk 1: API Availability -- **Risk**: DIFY service downtime affects translation -- **Mitigation**: Add timeout handling, retry logic, graceful error messages - -### Risk 2: API Cost -- **Risk**: High volume translation increases cost -- **Mitigation**: Monitor usage via metadata, consider rate limiting - -### Risk 3: Network Latency -- **Risk**: Each translation request adds network latency -- **Mitigation**: Batch text when possible, show progress to user - -### Risk 4: Translation Quality Variance -- **Risk**: AI translation quality varies by language pair -- **Mitigation**: Document known limitations, allow user feedback - -## Migration Plan - -### Phase 1: Core Translation (This Proposal) -1. DIFY client implementation -2. Backend translation service (rewrite) -3. API endpoints (modify) -4. Frontend activation - -### Phase 2: Enhanced Features (Future) -1. Translated PDF generation -2. Translation caching -3. Custom terminology support - -### Rollback -- Translation is additive feature -- No schema changes to existing data -- Can disable by removing router registration - -## Open Questions - -1. **Rate Limiting**: Should we limit requests per minute to DIFY API? - - Tentative: 10 requests per minute per user - -2. **Retry Logic**: How to handle API failures? - - Tentative: Retry up to 3 times with exponential backoff - -3. **Batch Size**: How many elements per API call? - - Tentative: 1 element per call for simplicity, optimize later if needed diff --git a/openspec/changes/archive/2025-12-02-add-document-translation/proposal.md b/openspec/changes/archive/2025-12-02-add-document-translation/proposal.md deleted file mode 100644 index 1fb7f91..0000000 --- a/openspec/changes/archive/2025-12-02-add-document-translation/proposal.md +++ /dev/null @@ -1,54 +0,0 @@ -# Change: Add Document Translation Feature - -## Why - -Users need to translate OCR-processed documents into different languages while preserving the original layout. Currently, the system only extracts text but cannot translate it. This feature enables multilingual document processing using DIFY AI service, providing high-quality translations with simple API integration. - -## What Changes - -- **NEW**: Translation service using DIFY AI API (Chat mode, Blocking) -- **NEW**: Translation REST API endpoints (`/api/v2/translate/*`) -- **NEW**: Translation result JSON format (independent file per target language) -- **UPDATE**: Frontend translation UI activation with progress display -- **REMOVED**: Local MADLAD-400-3B model (replaced with DIFY API) -- **REMOVED**: GPU memory management for translation (no longer needed) - -## Impact - -- Affected specs: - - NEW `specs/translation/spec.md` - Core translation capability - - MODIFY `specs/result-export/spec.md` - Add translation JSON export format - -- Affected code: - - `backend/app/services/translation_service.py` (REWRITE - use DIFY API) - - `backend/app/routers/translate.py` (MODIFY) - - `backend/app/schemas/translation.py` (MODIFY) - - `frontend/src/pages/TaskDetailPage.tsx` (MODIFY) - - `frontend/src/services/api.ts` (MODIFY) - -## Technical Summary - -### Translation Service -- Provider: DIFY AI (theaken.com) -- Mode: Chat (Blocking response) -- Base URL: `https://dify.theaken.com/v1` -- Endpoint: `POST /chat-messages` -- API Key: `app-YOPrF2ro5fshzMkCZviIuUJd` - -### Benefits over Local Model -| Aspect | DIFY API | Local MADLAD-400 | -|--------|----------|------------------| -| Quality | High (cloud AI) | Variable | -| Setup | No model download | 12GB download | -| GPU Usage | None | 2-3GB VRAM | -| Latency | ~1-2s per request | Fast after load | -| Maintenance | API provider managed | Self-managed | - -### Data Flow -1. Read `xxx_result.json` (UnifiedDocument format) -2. Extract translatable elements (text, title, header, footer, paragraph, footnote, table cells) -3. Send to DIFY API with translation prompt -4. Parse response and save to `xxx_translated_{lang}.json` - -### Unified Processing -All three tracks (Direct/OCR/Hybrid) use the same UnifiedDocument format, enabling unified translation logic without track-specific handling. diff --git a/openspec/changes/archive/2025-12-02-add-document-translation/specs/result-export/spec.md b/openspec/changes/archive/2025-12-02-add-document-translation/specs/result-export/spec.md deleted file mode 100644 index 3b8ac65..0000000 --- a/openspec/changes/archive/2025-12-02-add-document-translation/specs/result-export/spec.md +++ /dev/null @@ -1,55 +0,0 @@ -## ADDED Requirements - -### Requirement: Translation Result JSON Export - -The system SHALL support exporting translation results as independent JSON files following a defined schema. - -#### Scenario: Export translation result JSON -- **WHEN** translation completes for a document -- **THEN** system SHALL save translation to `{filename}_translated_{lang}.json` -- **AND** file SHALL be stored alongside original `{filename}_result.json` -- **AND** original result file SHALL remain unchanged - -#### Scenario: Translation JSON schema compliance -- **WHEN** translation result is saved -- **THEN** JSON SHALL include schema_version field ("1.0.0") -- **AND** SHALL include source_document reference -- **AND** SHALL include source_lang and target_lang -- **AND** SHALL include provider identifier (e.g., "dify") -- **AND** SHALL include translated_at timestamp -- **AND** SHALL include translations dict mapping element_id to translated content - -#### Scenario: Translation statistics in export -- **WHEN** translation result is saved -- **THEN** JSON SHALL include statistics object with: - - total_elements: count of all elements in document - - translated_elements: count of successfully translated elements - - skipped_elements: count of non-translatable elements (images, charts, etc.) - - total_characters: character count of translated text - - processing_time_seconds: translation duration - -#### Scenario: Table cell translation in export -- **WHEN** document contains tables -- **THEN** translation JSON SHALL represent table translations as: - ```json - { - "table_1_0": { - "cells": [ - {"row": 0, "col": 0, "content": "Translated cell text"}, - {"row": 0, "col": 1, "content": "Another cell"} - ] - } - } - ``` -- **AND** row/col positions SHALL match original table structure - -#### Scenario: Download translation result via API -- **WHEN** GET request to `/api/v2/translate/{task_id}/result?lang={lang}` -- **THEN** system SHALL return translation JSON content -- **AND** Content-Type SHALL be application/json -- **AND** response SHALL include appropriate cache headers - -#### Scenario: List available translations -- **WHEN** GET request to `/api/v2/tasks/{task_id}/translations` -- **THEN** system SHALL return list of available translation languages -- **AND** include translation metadata (translated_at, provider, statistics) diff --git a/openspec/changes/archive/2025-12-02-add-document-translation/specs/translation/spec.md b/openspec/changes/archive/2025-12-02-add-document-translation/specs/translation/spec.md deleted file mode 100644 index 3f2b8aa..0000000 --- a/openspec/changes/archive/2025-12-02-add-document-translation/specs/translation/spec.md +++ /dev/null @@ -1,184 +0,0 @@ -## ADDED Requirements - -### Requirement: Document Translation Service - -The system SHALL provide a document translation service that translates extracted text from OCR-processed documents into target languages using DIFY AI API. - -#### Scenario: Successful translation of Direct track document -- **GIVEN** a completed OCR task with Direct track processing -- **WHEN** user requests translation to English -- **THEN** the system extracts all translatable elements (text, title, header, footer, paragraph, footnote, table cells) -- **AND** translates them using DIFY AI API -- **AND** saves the result to `{task_id}_translated_en.json` - -#### Scenario: Successful translation of OCR track document -- **GIVEN** a completed OCR task with OCR track processing -- **WHEN** user requests translation to Japanese -- **THEN** the system extracts all translatable elements from UnifiedDocument format -- **AND** translates them preserving element_id mapping -- **AND** saves the result to `{task_id}_translated_ja.json` - -#### Scenario: Successful translation of Hybrid track document -- **GIVEN** a completed OCR task with Hybrid track processing -- **WHEN** translation is requested -- **THEN** the system processes the document using the same unified logic -- **AND** handles any combination of element types present - -#### Scenario: Table cell translation -- **GIVEN** a document containing table elements -- **WHEN** translation is requested -- **THEN** the system extracts text from each table cell -- **AND** translates each cell content individually -- **AND** preserves row/col position in the translation result - ---- - -### Requirement: Translation API Endpoints - -The system SHALL expose REST API endpoints for translation operations. - -#### Scenario: Start translation request -- **GIVEN** a completed OCR task with task_id -- **WHEN** POST request to `/api/v2/translate/{task_id}` with target_lang parameter -- **THEN** the system starts background translation process -- **AND** returns translation job status with 202 Accepted - -#### Scenario: Query translation status -- **GIVEN** an active translation job -- **WHEN** GET request to `/api/v2/translate/{task_id}/status` -- **THEN** the system returns current status (pending, translating, completed, failed) -- **AND** includes progress information (current_element, total_elements) - -#### Scenario: Retrieve translation result -- **GIVEN** a completed translation job -- **WHEN** GET request to `/api/v2/translate/{task_id}/result?lang={target_lang}` -- **THEN** the system returns the translation JSON content - -#### Scenario: Translation for non-existent task -- **GIVEN** an invalid or non-existent task_id -- **WHEN** translation is requested -- **THEN** the system returns 404 Not Found error - ---- - -### Requirement: DIFY API Integration - -The system SHALL integrate with DIFY AI service for translation. - -#### Scenario: API request format -- **GIVEN** text to be translated -- **WHEN** calling DIFY API -- **THEN** the system sends POST request to `/chat-messages` endpoint -- **AND** includes query with translation prompt -- **AND** uses blocking response mode -- **AND** includes user identifier for tracking - -#### Scenario: API response handling -- **GIVEN** DIFY API returns translation response -- **WHEN** parsing the response -- **THEN** the system extracts translated text from `answer` field -- **AND** records usage statistics (tokens, latency) - -#### Scenario: API error handling -- **GIVEN** DIFY API returns error or times out -- **WHEN** handling the error -- **THEN** the system retries up to 3 times with exponential backoff -- **AND** returns appropriate error message if all retries fail - -#### Scenario: API rate limiting -- **GIVEN** high volume of translation requests -- **WHEN** requests approach rate limits -- **THEN** the system queues requests appropriately -- **AND** provides feedback about wait times - ---- - -### Requirement: Translation Prompt Format - -The system SHALL use structured prompts for translation requests. - -#### Scenario: Generate translation prompt -- **GIVEN** source text to translate -- **WHEN** preparing DIFY API request -- **THEN** the system formats prompt as: - ``` - Translate the following text to {language}. - Return ONLY the translated text, no explanations. - - {text} - ``` - -#### Scenario: Language name mapping -- **GIVEN** language code like "zh-TW" or "ja" -- **WHEN** constructing translation prompt -- **THEN** the system maps to full language name (Traditional Chinese, Japanese) - ---- - -### Requirement: Translation Progress Reporting - -The system SHALL provide real-time progress feedback during translation. - -#### Scenario: Progress during multi-element translation -- **GIVEN** a document with 50 translatable elements -- **WHEN** user queries status -- **THEN** the system returns progress like `{"status": "translating", "current_element": 25, "total_elements": 50}` - -#### Scenario: Translation starting status -- **GIVEN** translation job just started -- **WHEN** user queries status -- **THEN** the system returns `{"status": "pending"}` - ---- - -### Requirement: Translation Result Storage - -The system SHALL store translation results as independent JSON files. - -#### Scenario: Save translation result -- **GIVEN** translation completes successfully -- **WHEN** saving results -- **THEN** the system creates `{original_filename}_translated_{lang}.json` -- **AND** includes schema_version, metadata, and translations dict - -#### Scenario: Multiple language translations -- **GIVEN** a document translated to English and Japanese -- **WHEN** checking result files -- **THEN** both `xxx_translated_en.json` and `xxx_translated_ja.json` exist -- **AND** original `xxx_result.json` is unchanged - ---- - -### Requirement: Language Support - -The system SHALL support common languages through DIFY AI service. - -#### Scenario: Common language translation -- **GIVEN** target language is English, Chinese, Japanese, or Korean -- **WHEN** translation is requested -- **THEN** the system includes appropriate language name in prompt -- **AND** executes translation successfully - -#### Scenario: Automatic source language detection -- **GIVEN** source_lang is set to "auto" -- **WHEN** translation is executed -- **THEN** the AI model automatically detects source language -- **AND** translates to target language - -#### Scenario: Supported languages list -- **GIVEN** user queries supported languages -- **WHEN** checking language support -- **THEN** the system provides list including: - - English (en) - - Traditional Chinese (zh-TW) - - Simplified Chinese (zh-CN) - - Japanese (ja) - - Korean (ko) - - German (de) - - French (fr) - - Spanish (es) - - Portuguese (pt) - - Italian (it) - - Russian (ru) - - Vietnamese (vi) - - Thai (th) diff --git a/openspec/changes/archive/2025-12-02-add-document-translation/tasks.md b/openspec/changes/archive/2025-12-02-add-document-translation/tasks.md deleted file mode 100644 index db4fd3f..0000000 --- a/openspec/changes/archive/2025-12-02-add-document-translation/tasks.md +++ /dev/null @@ -1,121 +0,0 @@ -# Implementation Tasks - -## 1. Backend - DIFY Client - -- [x] 1.1 Create DIFY client (`backend/app/services/dify_client.py`) - - HTTP client with httpx - - Base URL: `https://dify.theaken.com/v1` - - API Key configuration - - `translate(text, target_lang)` and `translate_batch(texts, target_lang)` methods - - Error handling and retry logic (3 retries, exponential backoff) - -- [x] 1.2 Add translation prompt template - - Format: "Translate the following text to {language}. Return ONLY the translated text, no explanations.\n\n{text}" - - Batch format with numbered markers [1], [2], [3]... - - Language name mapping (en → English, zh-TW → Traditional Chinese, etc.) - -## 2. Backend - Translation Service - -- [x] 2.1 Rewrite translation service (`backend/app/services/translation_service.py`) - - Use DIFY client instead of local model - - Element extraction from UnifiedDocument (all track types) - - Batch translation (MAX_BATCH_CHARS=5000, MAX_BATCH_ITEMS=20) - - Result parsing and element_id mapping - -- [x] 2.2 Create translation result JSON writer - - Schema version, metadata, translations dict - - Table cell handling with row/col positions - - Save to `{task_id}_translated_{lang}.json` - - Include usage statistics (tokens, latency, batch_count) - -- [x] 2.3 Add translatable element type handling - - Text types: `text`, `title`, `header`, `footer`, `paragraph`, `footnote` - - Table: Extract and translate `cells[].content` - - Skip: `page_number`, `image`, `chart`, `logo`, `reference` - -## 3. Backend - API Endpoints - -- [x] 3.1 Create/Update translation router (`backend/app/routers/translate.py`) - - POST `/api/v2/translate/{task_id}` - Start translation - - GET `/api/v2/translate/{task_id}/status` - Get progress - - GET `/api/v2/translate/{task_id}/result` - Get translation result - - GET `/api/v2/translate/{task_id}/translations` - List available translations - - DELETE `/api/v2/translate/{task_id}/translations/{lang}` - Delete translation - -- [x] 3.2 Implement background task processing - - Use FastAPI BackgroundTasks for async translation - - Status tracking (pending, translating, completed, failed) - - Progress reporting (current element / total elements) - -- [x] 3.3 Add translation schemas (`backend/app/schemas/translation.py`) - - TranslationRequest (task_id, target_lang) - - TranslationStatusResponse (status, progress, error) - - TranslationListResponse (translations, statistics) - -- [x] 3.4 Register router in main app - -## 4. Frontend - UI Updates - -- [x] 4.1 Enable translation UI in TaskDetailPage - - Translation state management - - Language selector connected to state - -- [x] 4.2 Add translation progress display - - Progress tracking - - Status polling (translating element X/Y) - - Error handling and display - -- [x] 4.3 Update API service - - Implement startTranslation method - - Add polling for translation status - - Handle translation result - -- [x] 4.4 Add translation complete state - - Show success message - - Display available translated versions - -## 5. Testing - -Use existing JSON files in `backend/storage/results/` for testing. - -Available test samples: -- Direct track: `1c94bfbf-*/edit_result.json`, `8eedd9ed-*/ppt_result.json` -- OCR track: `c85fff69-*/scan_result.json`, `ca2b59a3-*/img3_result.json` -- Hybrid track: `1484ba43-*/edit2_result.json` - -- [x] 5.1 Unit tests for DIFY client - - Test with real API calls (no mocks) - - Test retry logic on timeout - -- [x] 5.2 Unit tests for translation service - - Element extraction from existing result.json files (10 tests pass) - - Result parsing and element_id mapping - - Table cell extraction and translation - -- [x] 5.3 Integration tests for API endpoints - - Start translation with existing task_id - - Status polling during translation - - Result retrieval after completion - -- [x] 5.4 Manual E2E verification - - Translate Direct track document (edit_result.json → zh-TW) ✓ - - Verified translation quality and JSON structure - -## 6. Configuration - -- [x] 6.1 Add DIFY configuration (hardcoded in dify_client.py) - - `DIFY_BASE_URL`: https://dify.theaken.com/v1 - - `DIFY_API_KEY`: app-YOPrF2ro5fshzMkCZviIuUJd - - `DIFY_TIMEOUT`: 120 seconds - - `DIFY_MAX_RETRIES`: 3 - - `MAX_BATCH_CHARS`: 5000 - - `MAX_BATCH_ITEMS`: 20 - -## 7. Documentation - -- [ ] 7.1 Update API documentation - - Add translation endpoints to OpenAPI spec - -- [ ] 7.2 Add DIFY setup instructions - - API key configuration - - Rate limiting considerations diff --git a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/design.md b/openspec/changes/archive/2025-12-02-add-translated-pdf-export/design.md deleted file mode 100644 index 2690cb0..0000000 --- a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/design.md +++ /dev/null @@ -1,91 +0,0 @@ -# Design: Add Translated PDF Export - -## Context - -The Tool_OCR project has implemented document translation using DIFY AI API, producing JSON files with translated content mapped by element_id. The existing PDF generator (`PDFGeneratorService`) can generate layout-preserving PDFs from UnifiedDocument but has no translation support. - -**Key Constraint**: The PDF generator uses element_id to position content. Translation JSON uses the same element_id mapping, making merging straightforward. - -## Goals / Non-Goals - -**Goals:** -- Generate PDF with translated text preserving original layout -- Support all processing tracks (DIRECT, OCR, HYBRID) -- Maintain backward compatibility with existing PDF export -- Support table cell translation rendering - -**Non-Goals:** -- Font optimization for target language scripts -- Interactive editing of translations -- Bilingual PDF output (original + translated side-by-side) - -## Decisions - -### Decision 1: Translation Merge Strategy - -**What**: Merge translation data into UnifiedDocument in-memory before PDF generation. - -**Why**: This approach: -- Reuses existing PDF rendering logic unchanged -- Keeps translation and PDF generation decoupled -- Allows easy testing of merged document - -**Implementation**: -```python -def apply_translations( - unified_doc: UnifiedDocument, - translations: Dict[str, Any] -) -> UnifiedDocument: - """Apply translations to UnifiedDocument, returning modified copy""" - doc_copy = unified_doc.copy(deep=True) - for page in doc_copy.pages: - for element in page.elements: - if element.element_id in translations: - translation = translations[element.element_id] - if isinstance(translation, str): - element.content = translation - elif isinstance(translation, dict) and 'cells' in translation: - # Handle table cells - apply_table_translation(element, translation) - return doc_copy -``` - -**Alternatives considered**: -- Modify PDF generator to accept translations directly - Would require significant refactoring -- Generate overlay PDF with translations - Complex positioning logic - -### Decision 2: API Endpoint Design - -**What**: Add `POST /api/v2/translate/{task_id}/pdf?lang={target_lang}` endpoint. - -**Why**: -- Consistent with existing `/translate/{task_id}` pattern -- POST allows future expansion for PDF options -- Clear separation from existing `/download/pdf` endpoint - -**Response**: Binary PDF file with `application/pdf` content-type. - -### Decision 3: Frontend Integration - -**What**: Add conditional "Download Translated PDF" button in TaskDetailPage. - -**Why**: -- Only show when translation is complete -- Use existing download pattern from PDF export - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| Large documents may timeout | Use existing async pattern, add progress tracking | -| Font rendering for CJK scripts | Rely on existing NotoSansSC font registration | -| Translation missing for some elements | Use original content as fallback | - -## Migration Plan - -No migration needed - additive feature only. - -## Open Questions - -1. Should we support downloading multiple translated PDFs in batch? -2. Should translated PDF filename include source language as well as target? diff --git a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/proposal.md b/openspec/changes/archive/2025-12-02-add-translated-pdf-export/proposal.md deleted file mode 100644 index bb0c37f..0000000 --- a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/proposal.md +++ /dev/null @@ -1,29 +0,0 @@ -# Change: Add Translated PDF Export - -## Why - -The current translation feature produces JSON output files (`{filename}_translated_{lang}.json`) but does not support generating translated PDFs. Users need to download translated documents in PDF format with the original layout preserved but with translated text content. This is essential for document localization workflows where the final deliverable must be a properly formatted PDF. - -## What Changes - -- **PDF Generator**: Add translation parameter support to `PDFGeneratorService` -- **Translation Merger**: Create logic to merge translation JSON with UnifiedDocument -- **API Endpoint**: Add `POST /api/v2/translate/{task_id}/pdf` endpoint -- **Frontend UI**: Add "Download Translated PDF" button in TaskDetailPage -- **Batch Translation Enhancement**: Improve batch response parsing for edge cases - -## Impact - -- **Affected specs**: `translation`, `result-export` -- **Affected code**: - - `backend/app/services/pdf_generator_service.py` - Add translation rendering - - `backend/app/services/translation_service.py` - Add PDF generation integration - - `backend/app/routers/translate.py` - Add PDF download endpoint - - `frontend/src/pages/TaskDetailPage.tsx` - Add PDF download button - - `frontend/src/services/apiV2.ts` - Add PDF download API method - -## Non-Goals - -- Editing translated text before PDF export (future feature) -- Supporting formats other than PDF (Excel, Word) -- Font substitution for different target languages diff --git a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/result-export/spec.md b/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/result-export/spec.md deleted file mode 100644 index c9b4b4c..0000000 --- a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/result-export/spec.md +++ /dev/null @@ -1,55 +0,0 @@ -## ADDED Requirements - -### Requirement: Translated PDF Export API - -The system SHALL expose an API endpoint for downloading translated documents as PDF files. - -#### Scenario: Download translated PDF via API -- **GIVEN** a task with completed translation to English -- **WHEN** POST request to `/api/v2/translate/{task_id}/pdf?lang=en` -- **THEN** system returns PDF file with translated content -- **AND** Content-Type is `application/pdf` -- **AND** Content-Disposition suggests filename like `{task_id}_translated_en.pdf` - -#### Scenario: Download translated PDF with layout preservation -- **WHEN** user downloads translated PDF -- **THEN** the PDF maintains original document layout -- **AND** text positions match original document coordinates -- **AND** images and tables appear at original positions - -#### Scenario: Invalid language parameter -- **GIVEN** a task with translation only to English -- **WHEN** user requests PDF with `lang=ja` (Japanese) -- **THEN** system returns 404 Not Found -- **AND** response includes available languages in error message - -#### Scenario: Task not found -- **GIVEN** non-existent task_id -- **WHEN** user requests translated PDF -- **THEN** system returns 404 Not Found - ---- - -### Requirement: Frontend Translated PDF Download - -The frontend SHALL provide UI controls for downloading translated PDFs. - -#### Scenario: Show download button when translation complete -- **GIVEN** a task with translation status "completed" -- **WHEN** user views TaskDetailPage -- **THEN** page displays "Download Translated PDF" button -- **AND** button shows target language (e.g., "Download Translated PDF (English)") - -#### Scenario: Hide download button when no translation -- **GIVEN** a task without any completed translations -- **WHEN** user views TaskDetailPage -- **THEN** "Download Translated PDF" button is not shown - -#### Scenario: Download progress indication -- **GIVEN** user clicks "Download Translated PDF" button -- **WHEN** PDF generation is in progress -- **THEN** button shows loading state -- **AND** prevents double-click -- **WHEN** download completes -- **THEN** browser downloads PDF file -- **AND** button returns to normal state diff --git a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/translation/spec.md b/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/translation/spec.md deleted file mode 100644 index 0765b05..0000000 --- a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/specs/translation/spec.md +++ /dev/null @@ -1,72 +0,0 @@ -## ADDED Requirements - -### Requirement: Translated PDF Generation - -The system SHALL support generating PDF files with translated content while preserving the original document layout. - -#### Scenario: Generate translated PDF from Direct track document -- **GIVEN** a completed translation for a Direct track processed document -- **WHEN** user requests translated PDF via `POST /api/v2/translate/{task_id}/pdf?lang={target_lang}` -- **THEN** the system loads the translation JSON file -- **AND** merges translations with UnifiedDocument by element_id -- **AND** generates PDF with translated text at original positions -- **AND** returns PDF file with Content-Type `application/pdf` - -#### Scenario: Generate translated PDF from OCR track document -- **GIVEN** a completed translation for an OCR track processed document -- **WHEN** user requests translated PDF -- **THEN** the system generates PDF preserving all OCR layout information -- **AND** replaces original text with translated content -- **AND** maintains table structure with translated cell content - -#### Scenario: Handle missing translations gracefully -- **GIVEN** a translation JSON missing some element_id entries -- **WHEN** generating translated PDF -- **THEN** the system uses original content for missing translations -- **AND** logs warning for each fallback -- **AND** completes PDF generation successfully - -#### Scenario: Translated PDF for incomplete translation -- **GIVEN** a task with translation status "pending" or "translating" -- **WHEN** user requests translated PDF -- **THEN** the system returns 400 Bad Request -- **AND** includes error message indicating translation not complete - -#### Scenario: Translated PDF for non-existent translation -- **GIVEN** a task that has not been translated to requested language -- **WHEN** user requests translated PDF with `lang=fr` -- **THEN** the system returns 404 Not Found -- **AND** includes error message indicating no translation for language - ---- - -### Requirement: Translation Merge Service - -The system SHALL provide a service to merge translation data with UnifiedDocument. - -#### Scenario: Merge text element translations -- **GIVEN** a UnifiedDocument with text elements -- **AND** a translation JSON with matching element_ids -- **WHEN** applying translations -- **THEN** the system replaces content field for each matched element -- **AND** preserves all other element properties (bounding_box, style_info, etc.) - -#### Scenario: Merge table cell translations -- **GIVEN** a UnifiedDocument containing table elements -- **AND** a translation JSON with table_cell translations like: - ```json - { - "table_1_0": { - "cells": [{"row": 0, "col": 0, "content": "Translated"}] - } - } - ``` -- **WHEN** applying translations -- **THEN** the system updates cell content at matching row/col positions -- **AND** preserves cell structure and styling - -#### Scenario: Non-destructive merge operation -- **GIVEN** a UnifiedDocument -- **WHEN** applying translations -- **THEN** the system creates a modified copy -- **AND** original UnifiedDocument remains unchanged diff --git a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/tasks.md b/openspec/changes/archive/2025-12-02-add-translated-pdf-export/tasks.md deleted file mode 100644 index 5284944..0000000 --- a/openspec/changes/archive/2025-12-02-add-translated-pdf-export/tasks.md +++ /dev/null @@ -1,40 +0,0 @@ -# Tasks: Add Translated PDF Export - -## 1. Backend - Translation Merger Service - -- [x] 1.1 Create `apply_translations()` function in `translation_service.py` -- [x] 1.2 Implement table cell translation merging logic -- [x] 1.3 Add unit tests for translation merging - -## 2. Backend - PDF Generator Enhancement - -- [x] 2.1 Add `generate_translated_pdf()` method to `PDFGeneratorService` -- [x] 2.2 Load translation JSON and merge with UnifiedDocument -- [x] 2.3 Handle missing translations gracefully (fallback to original) -- [x] 2.4 Add unit tests for translated PDF generation - -## 3. Backend - API Endpoint - -- [x] 3.1 Add `POST /api/v2/translate/{task_id}/pdf` endpoint in `translate.py` -- [x] 3.2 Validate task exists and has completed translation -- [x] 3.3 Return appropriate errors (404 if no translation, 400 if task not complete) -- [x] 3.4 Add endpoint tests - -## 4. Frontend - UI Integration - -- [x] 4.1 Add `downloadTranslatedPdf()` method to `apiV2.ts` -- [x] 4.2 Add "Download Translated PDF" button in `TaskDetailPage.tsx` -- [x] 4.3 Show button only when translation status is "completed" -- [x] 4.4 Add loading state during PDF generation - -## 5. Testing & Validation - -- [x] 5.1 End-to-end test: translate document then download PDF -- [x] 5.2 Test with Direct track document -- [x] 5.3 Test with OCR track document -- [x] 5.4 Test with document containing tables - -## 6. Documentation - -- [ ] 6.1 Update API documentation with new endpoint -- [ ] 6.2 Add usage example in README if applicable diff --git a/openspec/changes/archive/2025-12-02-unify-environment-scripts/design.md b/openspec/changes/archive/2025-12-02-unify-environment-scripts/design.md deleted file mode 100644 index 7f124e7..0000000 --- a/openspec/changes/archive/2025-12-02-unify-environment-scripts/design.md +++ /dev/null @@ -1,96 +0,0 @@ -# Design: Unify Environment Scripts - -## Context - -Tool_OCR 是一個 OCR 處理系統,包含 FastAPI 後端和 React 前端。目前的環境設置和啟動流程分散在多個腳本中: - -- `setup_dev_env.sh` (341 lines) - 完整的開發環境設置 -- `start_backend.sh` (60 lines) - 後端服務啟動 -- `start_frontend.sh` (41 lines) - 前端服務啟動 -- `download_fonts.sh` - 字體下載 (可選) - -**Current Pain Points:** -1. 需要開兩個終端分別啟動前後端 -2. 環境變數配置分散在多個 `.env` 文件中 -3. 缺少服務狀態檢查和優雅停止機制 -4. GPU 配置需要手動驗證 - -## Goals / Non-Goals - -**Goals:** -- 簡化開發環境設置流程 -- 統一前後端啟動為單一命令 -- 改善錯誤訊息和故障排除體驗 -- 確保跨平台兼容性 (Ubuntu, WSL2) - -**Non-Goals:** -- 容器化 (Docker) - 保留為未來提案 -- 生產環境部署腳本 -- 自動化測試環境設置 - -## Decisions - -### 1. Unified Startup Script Design - -**Decision:** 使用單一 `start.sh` 腳本,支援多種模式 - -```bash -# Usage examples: -./start.sh # Start both backend and frontend -./start.sh backend # Start only backend -./start.sh frontend # Start only frontend -./start.sh --stop # Stop all services -./start.sh --status # Show service status -``` - -**Rationale:** 保持簡單,使用 bash 內建功能管理子進程,不引入額外依賴如 PM2 或 supervisord。 - -### 2. Process Management - -**Decision:** 使用 PID 文件追蹤進程,支援優雅停止 - -- PID files stored in `.pid/` directory (gitignored) -- SIGTERM for graceful shutdown -- Fallback to SIGKILL after timeout - -### 3. Environment Variable Strategy - -**Decision:** 保持分離的 `.env` 文件,但改善文檔 - -- Root `.env.local` - 共用配置 (database, API keys) -- Frontend `.env.local` - 前端特定配置 (VITE_* variables) -- `.env.example` files updated with all required variables - -**Rationale:** 前端 Vite 需要 `VITE_` 前綴的變數,混合會造成困擾。 - -### 4. Package Dependencies - -**Current State:** -- Python: 77 packages in requirements.txt -- Node.js: 17 dependencies + 15 devDependencies - -**Key Dependencies to Audit:** -- PaddlePaddle GPU vs CPU selection -- Unused development tools -- Duplicate functionality packages - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| Breaking existing workflows | Keep old scripts as aliases | -| Complex process management | Start simple, add features incrementally | -| WSL2 compatibility issues | Test on clean WSL2 Ubuntu 22.04 | - -## Migration Plan - -1. Create new `start.sh` alongside existing scripts -2. Update documentation to recommend new script -3. After validation, deprecate old scripts (keep for one release) -4. Remove old scripts in future version - -## Open Questions - -1. Should we add systemd service files for production? -2. Should we include database migration in startup? -3. Should we add health check retries before opening browser? diff --git a/openspec/changes/archive/2025-12-02-unify-environment-scripts/proposal.md b/openspec/changes/archive/2025-12-02-unify-environment-scripts/proposal.md deleted file mode 100644 index a1a0e79..0000000 --- a/openspec/changes/archive/2025-12-02-unify-environment-scripts/proposal.md +++ /dev/null @@ -1,25 +0,0 @@ -# Change: Unify Environment Setup and Startup Scripts - -## Why - -目前專案有多個獨立的腳本來設置環境和啟動服務,開發者需要執行多個命令才能完整啟動專案。這增加了入門門檻並可能導致配置不一致的問題。 - -Currently the project has multiple independent scripts for environment setup and service startup, requiring developers to run multiple commands to fully start the project. This increases the onboarding barrier and may lead to configuration inconsistencies. - -## What Changes - -- **Audit and update package lists**: Review and clean up `requirements.txt` and `frontend/package.json` -- **Consolidate environment files**: Merge `.env` configuration across root and frontend directories -- **Update setup script**: Improve `setup_dev_env.sh` with better error handling and dependency validation -- **Create unified startup script**: Combine `start_backend.sh` and `start_frontend.sh` into a single `start.sh` with options - -## Impact - -- Affected files: - - `requirements.txt` - Python dependencies audit - - `frontend/package.json` - Node.js dependencies audit - - `setup_dev_env.sh` - Updated environment setup - - `start_backend.sh` - Will be replaced by unified script - - `start_frontend.sh` - Will be replaced by unified script - - `start.sh` - New unified startup script - - `.env*` files - Environment configuration cleanup diff --git a/openspec/changes/archive/2025-12-02-unify-environment-scripts/specs/development-environment/spec.md b/openspec/changes/archive/2025-12-02-unify-environment-scripts/specs/development-environment/spec.md deleted file mode 100644 index 5eefcdc..0000000 --- a/openspec/changes/archive/2025-12-02-unify-environment-scripts/specs/development-environment/spec.md +++ /dev/null @@ -1,73 +0,0 @@ -# Development Environment - -This capability defines the development environment setup and service management for Tool_OCR. - -## ADDED Requirements - -### Requirement: Unified Service Startup - -The system SHALL provide a single startup script that can launch all services or individual components. - -#### Scenario: Start all services - -- **WHEN** developer runs `./start.sh` -- **THEN** both backend and frontend services start -- **AND** service URLs are displayed - -#### Scenario: Start only backend - -- **WHEN** developer runs `./start.sh backend` -- **THEN** only the backend service starts on port 8000 - -#### Scenario: Start only frontend - -- **WHEN** developer runs `./start.sh frontend` -- **THEN** only the frontend service starts on port 5173 - -### Requirement: Service Process Management - -The system SHALL provide commands to check status and stop running services. - -#### Scenario: Check service status - -- **WHEN** developer runs `./start.sh --status` -- **THEN** the script displays which services are running with their PIDs - -#### Scenario: Stop all services - -- **WHEN** developer runs `./start.sh --stop` -- **THEN** all running services are gracefully terminated - -### Requirement: Environment Setup Validation - -The setup script SHALL validate that all required dependencies are correctly installed. - -#### Scenario: Validate Python environment - -- **WHEN** setup script runs -- **THEN** Python version is checked (3.10+) -- **AND** virtual environment is created or verified -- **AND** all pip packages are installed - -#### Scenario: Validate Node.js environment - -- **WHEN** setup script runs -- **THEN** Node.js LTS version is installed via nvm -- **AND** npm dependencies are installed - -#### Scenario: Validate GPU support (optional) - -- **WHEN** setup script runs on a system with NVIDIA GPU -- **THEN** CUDA version is detected -- **AND** appropriate PaddlePaddle GPU version is installed -- **AND** GPU availability is verified - -### Requirement: Environment Configuration - -The system SHALL use `.env.example` files to document all required environment variables. - -#### Scenario: New developer setup - -- **WHEN** developer clones the repository -- **THEN** `.env.example` files document all required variables -- **AND** developer copies to `.env.local` and fills in values diff --git a/openspec/changes/archive/2025-12-02-unify-environment-scripts/tasks.md b/openspec/changes/archive/2025-12-02-unify-environment-scripts/tasks.md deleted file mode 100644 index 688c9fa..0000000 --- a/openspec/changes/archive/2025-12-02-unify-environment-scripts/tasks.md +++ /dev/null @@ -1,31 +0,0 @@ -# Tasks: Unify Environment Scripts - -## 1. Package Audit and Update -- [x] 1.1 Review Python dependencies in `requirements.txt` for unused/outdated packages -- [x] 1.2 Review Node.js dependencies in `frontend/package.json` -- [x] 1.3 Document system-level dependencies (apt packages, nvm, pandoc, etc.) -- [x] 1.4 Update `requirements.txt` with corrected versions and comments - -## 2. Environment Configuration Cleanup -- [x] 2.1 Audit `.env`, `.env.local`, `.env.example` files in root and frontend -- [x] 2.2 Consolidate duplicate environment variables -- [x] 2.3 Update `.env.example` with all required variables -- [x] 2.4 Remove sensitive data from version control (if any) - -## 3. Update Setup Script -- [x] 3.1 Add pre-flight checks (disk space, memory, existing installations) -- [x] 3.2 Improve error messages with solutions -- [x] 3.3 Add optional component installation (e.g., skip GPU detection if not needed) -- [x] 3.4 Add database initialization step to setup script -- [ ] 3.5 Test setup on clean Ubuntu 22.04/WSL2 environment - -## 4. Create Unified Startup Script -- [x] 4.1 Create `start.sh` with subcommands (all/backend/frontend) -- [x] 4.2 Add process management (start, stop, restart, status) -- [x] 4.3 Add log output options (combined/separate) -- [x] 4.4 Preserve individual scripts for backwards compatibility (optional) - Removed old scripts -- [x] 4.5 Test unified script in development environment - -## 5. Documentation -- [ ] 5.1 Update README with new startup instructions (skipped per user request) -- [ ] 5.2 Add troubleshooting section for common issues (skipped per user request) diff --git a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md b/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md deleted file mode 100644 index 6576214..0000000 --- a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md +++ /dev/null @@ -1,68 +0,0 @@ -# Design: Fix PDF Table Rendering - -## Context -OCR track produces tables with: -- `cell_boxes`: Accurate pixel coordinates for each cell border -- `cells`: Content with row/col indices and row_span/col_span -- `embedded_images`: Images within table cells - -Current implementations fail to use these correctly: -- **Reflow PDF**: Ignores merged cells, misaligns content -- **Translated Layout PDF**: Creates new Table object instead of using cell_boxes - -## Goals / Non-Goals - -**Goals:** -- Translated Layout PDF tables match untranslated Layout PDF quality -- Reflow PDF tables are readable and correctly structured -- Embedded images appear in both formats - -**Non-Goals:** -- Perfect pixel-level replication of original table styling -- Support for complex nested tables - -## Decisions - -### Decision 1: Translated Layout PDF uses Layered Rendering -**What**: Draw cell borders using `cell_boxes`, then render translated text in each cell separately -**Why**: This matches the working approach in `_draw_table_with_cell_boxes()` for untranslated PDFs - -```python -# Step 1: Draw borders using cell_boxes -for cell_box in cell_boxes: - pdf_canvas.rect(x, y, width, height) - -# Step 2: Render text for each cell -for cell in cells: - cell_bbox = find_matching_cell_box(cell, cell_boxes) - draw_text_in_bbox(translated_content, cell_bbox) -``` - -### Decision 2: Reflow PDF uses ReportLab SPAN for merged cells -**What**: Apply `('SPAN', (col1, row1), (col2, row2))` style for merged cells -**Why**: ReportLab's Table natively supports merged cells via TableStyle - -```python -# Build span commands from cell data -for cell in cells: - if cell.row_span > 1 or cell.col_span > 1: - spans.append(('SPAN', - (cell.col, cell.row), - (cell.col + cell.col_span - 1, cell.row + cell.row_span - 1))) -``` - -### Decision 3: Column widths from cell_boxes ratio -**What**: Calculate column widths proportionally from cell_boxes -**Why**: Preserves original table structure in reflow mode - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| Text overflow in translated cells | Shrink font (min 8pt) or truncate with ellipsis | -| cell_boxes not matching cells count | Fall back to equal-width columns | -| Complex merged cell patterns | Handle simple spans, skip complex patterns | - -## Open Questions -- Should reflow PDF preserve exact column width ratios or allow ReportLab auto-sizing? -- How to handle cells with both text and images? diff --git a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/proposal.md b/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/proposal.md deleted file mode 100644 index 2650cf7..0000000 --- a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/proposal.md +++ /dev/null @@ -1,18 +0,0 @@ -# Change: Fix PDF Table Rendering Issues - -## Why -OCR track PDF exports have significant table rendering problems: -1. **Reflow PDF** (both translated and untranslated): Tables are misaligned due to missing row_span/col_span support -2. **Translated Layout PDF**: Table borders disappear and text overlaps because it doesn't use the accurate `cell_boxes` positioning - -## What Changes -- **Translated Layout PDF**: Adopt layered rendering approach (borders + text separately) using `cell_boxes` from metadata -- **Reflow PDF Tables**: Fix cell extraction and add basic merged cell support -- Ensure embedded images in tables are rendered correctly in all PDF formats - -## Impact -- Affected specs: result-export -- Affected code: - - `backend/app/services/pdf_generator_service.py` - - `_draw_translated_table()` - needs complete rewrite - - `_create_reflow_table()` - needs merged cell support diff --git a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/specs/result-export/spec.md b/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/specs/result-export/spec.md deleted file mode 100644 index 245c527..0000000 --- a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/specs/result-export/spec.md +++ /dev/null @@ -1,41 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Translated Layout PDF Generation -The system SHALL generate layout-preserving PDFs with translated content that maintain accurate table structure. - -#### Scenario: Table with accurate borders -- **GIVEN** an OCR result with tables containing `cell_boxes` metadata -- **WHEN** generating translated layout PDF -- **THEN** table cell borders SHALL be drawn at positions matching `cell_boxes` -- **AND** translated text SHALL be rendered within each cell's bounding box - -#### Scenario: Text overflow handling -- **GIVEN** translated text longer than original text -- **WHEN** text exceeds cell bounding box -- **THEN** the system SHALL reduce font size (minimum 8pt) to fit content -- **OR** truncate with ellipsis if minimum font size is insufficient - -#### Scenario: Embedded images in tables -- **GIVEN** a table with `embedded_images` in metadata -- **WHEN** generating translated layout PDF -- **THEN** images SHALL be rendered at their original positions within the table - -### Requirement: Reflow PDF Table Rendering -The system SHALL generate reflow PDFs with properly structured tables including merged cell support. - -#### Scenario: Basic table rendering -- **GIVEN** an OCR result with table cells containing `row`, `col`, `content` -- **WHEN** generating reflow PDF -- **THEN** cells SHALL be grouped by row and column indices -- **AND** table SHALL render with visible borders - -#### Scenario: Merged cells support -- **GIVEN** table cells with `row_span` or `col_span` greater than 1 -- **WHEN** generating reflow PDF -- **THEN** the system SHALL apply appropriate cell spanning -- **AND** merged cells SHALL display content without duplication - -#### Scenario: Column width calculation -- **GIVEN** a table with `cell_boxes` metadata -- **WHEN** generating reflow PDF -- **THEN** column widths SHOULD be proportional to original cell widths diff --git a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/tasks.md b/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/tasks.md deleted file mode 100644 index 0926685..0000000 --- a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/tasks.md +++ /dev/null @@ -1,20 +0,0 @@ -# Tasks: Fix PDF Table Rendering - -## 1. Translated Layout PDF - Table Fix (P0) -- [ ] 1.1 Refactor `_draw_translated_table()` to use layered rendering approach -- [ ] 1.2 Use `cell_boxes` from metadata for accurate border positioning -- [ ] 1.3 Render translated text within each cell's bbox using Paragraph with wordWrap -- [ ] 1.4 Handle text overflow (shrink font to minimum 8pt or truncate) -- [ ] 1.5 Draw embedded images at correct positions - -## 2. Reflow PDF - Table Fix (P1) -- [ ] 2.1 Fix `_create_reflow_table()` cell extraction from content dict -- [ ] 2.2 Add row_span/col_span handling using ReportLab SPAN style -- [ ] 2.3 Calculate proportional column widths based on cell_boxes -- [ ] 2.4 Embed images in table cells instead of after table - -## 3. Testing & Validation -- [ ] 3.1 Test with task 48b9e849-f6e3-462f-83a1-911ded701958 (has merged cells) -- [ ] 3.2 Verify translated layout PDF has visible borders -- [ ] 3.3 Verify reflow PDF tables align correctly -- [ ] 3.4 Verify embedded images appear in both formats diff --git a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md deleted file mode 100644 index 0eba861..0000000 --- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md +++ /dev/null @@ -1,167 +0,0 @@ -## Context - -The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but: -1. Fails for translated content where text length differs significantly -2. May not provide the best reading experience for flowing documents - -Two PDF generation modes are needed: -1. **Layout Preservation** (existing): Maintains original coordinates -2. **Reflow Layout** (new): Prioritizes readability with flowing content - -## Goals / Non-Goals - -**Goals:** -- Translated and non-translated documents can use reflow layout -- Both OCR and Direct tracks supported -- Proper reading order preserved using available data -- Consistent font sizes for readability -- Images and tables embedded inline - -**Non-Goals:** -- Perfect visual matching with original document layout -- Complex multi-column reflow (simple single-column flow) -- Font style matching from original document - -## Decisions - -### Decision 1: Reading Order Strategy - -| Track | Reading Order Source | Implementation | -|-------|---------------------|----------------| -| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements | -| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) | - -**OCR Track - reading_order array:** -```json -{ - "pages": [{ - "reading_order": [0, 1, 2, 3, 6, 7, 8, ...], - "elements": [...] - }] -} -``` - -**Direct Track - implicit order:** -- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order -- Elements already sorted by extraction engine -- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection - -### Decision 2: Separate API Endpoints - -``` -# Layout preservation (existing) -GET /api/v2/tasks/{task_id}/download/pdf - -# Reflow layout (new) -GET /api/v2/tasks/{task_id}/download/pdf?format=reflow - -# Translated PDF (reflow only) -POST /api/v2/translate/{task_id}/pdf?lang={lang} -``` - -### Decision 3: Unified Reflow Generation Method - -```python -def generate_reflow_pdf( - self, - result_json_path: Path, - output_path: Path, - translation_json_path: Optional[Path] = None, # None = no translation - source_file_path: Optional[Path] = None, # For embedded images -) -> bool: - """ - Generate reflow layout PDF for either OCR or Direct track. - Works with or without translation. - """ -``` - -### Decision 4: Reading Order Application - -```python -def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]: - """Get elements sorted by reading order.""" - elements = page_data.get('elements', []) - reading_order = page_data.get('reading_order') - - if reading_order: - # OCR track: use explicit reading order - ordered = [] - for idx in reading_order: - if 0 <= idx < len(elements): - ordered.append(elements[idx]) - return ordered - else: - # Direct track: elements already in reading order - return elements -``` - -### Decision 5: Consistent Typography - -| Element Type | Font Size | Style | -|-------------|-----------|-------| -| Title/H1 | 18pt | Bold | -| H2 | 16pt | Bold | -| H3 | 14pt | Bold | -| Body text | 12pt | Normal| -| Table cell | 10pt | Normal| -| Caption | 10pt | Italic| - -### Decision 6: Table Handling in Reflow - -Tables use Platypus Table with auto-width columns: - -```python -def _create_reflow_table(self, table_data, translations=None): - data = [] - for row in table_data['rows']: - row_data = [] - for cell in row['cells']: - text = cell.get('text', '') - if translations: - text = translations.get(cell.get('id'), text) - row_data.append(Paragraph(text, self.styles['TableCell'])) - data.append(row_data) - - table = Table(data) - table.setStyle(TableStyle([ - ('GRID', (0, 0), (-1, -1), 0.5, colors.black), - ('VALIGN', (0, 0), (-1, -1), 'TOP'), - ('PADDING', (0, 0), (-1, -1), 6), - ])) - return table -``` - -### Decision 7: Image Embedding - -```python -def _embed_image_reflow(self, element, max_width=450): - img_path = self._resolve_image_path(element) - if img_path and img_path.exists(): - img = Image(str(img_path)) - # Scale to fit page width - if img.drawWidth > max_width: - ratio = max_width / img.drawWidth - img.drawWidth = max_width - img.drawHeight *= ratio - return img - return Spacer(1, 0) -``` - -## Risks / Trade-offs - -- **Risk**: OCR reading_order may not be accurate for complex layouts - - **Mitigation**: Falls back to spatial sort if reading_order missing - -- **Risk**: Direct track multi-column detection unused - - **Mitigation**: PyMuPDF sort=True is generally reliable - -- **Risk**: Loss of visual fidelity compared to original - - **Mitigation**: This is acceptable; layout PDF still available - -## Migration Plan - -No migration needed - new functionality, existing behavior unchanged. - -## Open Questions - -None - design confirmed with user. diff --git a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md deleted file mode 100644 index 96aa0a9..0000000 --- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md +++ /dev/null @@ -1,41 +0,0 @@ -# Change: Reflow Layout PDF Export for All Tracks - -## Why - -When generating translated PDFs, text often doesn't fit within original bounding boxes due to language expansion/contraction differences. Additionally, users may want a readable flowing document format even without translation. - -**Example from task c79df0ad-f9a6-4c04-8139-13eaef25fa83:** -- Original Chinese: "华天科技(宝鸡)有限公司设备版块报价单" (19 characters) -- Translated English: "Huatian Technology (Baoji) Co., Ltd. Equipment Division Quotation" (65+ characters) -- Same bounding box: 703×111 pixels -- Current result: Font reduced to minimum (3pt), text unreadable - -## What Changes - -- **NEW**: Add reflow layout PDF generation for both OCR and Direct tracks -- Preserve semantic structure (headings, tables, lists) in reflow mode -- Use consistent, readable font sizes (12pt body, 16pt headings) -- Embed images inline within flowing content -- **IMPORTANT**: Original layout preservation PDF generation remains unchanged -- Support both tracks with proper reading order: - - **OCR track**: Use existing `reading_order` array from PP-StructureV3 - - **Direct track**: Use PyMuPDF's implicit order (with option for column detection) -- **FIX**: Remove outdated MADLAD-400 references from frontend (now uses Dify cloud translation) - -## Download Options - -| Scenario | Layout PDF | Reflow PDF | -|----------|------------|------------| -| **Without Translation** | Available | Available (NEW) | -| **With Translation** | - | Available (single option, unchanged) | - -## Impact - -- Affected specs: `specs/result-export/spec.md` -- Affected code: - - `backend/app/services/pdf_generator_service.py` - add reflow generation method - - `backend/app/routers/tasks.py` - add reflow PDF download endpoint - - `backend/app/routers/translate.py` - use reflow mode for translated PDFs - - `frontend/src/pages/TaskDetailPage.tsx`: - - Add "Download Reflow PDF" button for original documents - - Remove MADLAD-400 badge and outdated description text diff --git a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md deleted file mode 100644 index 25f16a1..0000000 --- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md +++ /dev/null @@ -1,137 +0,0 @@ -## ADDED Requirements - -### Requirement: Dual PDF Generation Modes - -The system SHALL support two distinct PDF generation modes to serve different use cases for both OCR and Direct tracks. - -#### Scenario: Download layout preservation PDF -- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf` -- **THEN** PDF SHALL use layout preservation mode -- **AND** text positions SHALL match original document coordinates -- **AND** this option SHALL be available for both OCR and Direct tracks -- **AND** existing behavior SHALL remain unchanged - -#### Scenario: Download reflow layout PDF without translation -- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf?format=reflow` -- **THEN** PDF SHALL use reflow layout mode -- **AND** text SHALL flow naturally with consistent font sizes -- **AND** body text SHALL use approximately 12pt font size -- **AND** headings SHALL use larger font sizes (14-18pt) -- **AND** this option SHALL be available for both OCR and Direct tracks - -#### Scenario: OCR track reading order in reflow mode -- **GIVEN** document processed via OCR track -- **WHEN** generating reflow PDF -- **THEN** system SHALL use explicit `reading_order` array from JSON -- **AND** elements SHALL appear in order specified by reading_order indices -- **AND** if reading_order is missing, fall back to spatial sort (y, x) - -#### Scenario: Direct track reading order in reflow mode -- **GIVEN** document processed via Direct track -- **WHEN** generating reflow PDF -- **THEN** system SHALL use implicit element order from extraction -- **AND** elements SHALL appear in list iteration order -- **AND** PyMuPDF's sort=True ordering SHALL be trusted - ---- - -### Requirement: Reflow PDF Semantic Structure - -The reflow PDF generation SHALL preserve document semantic structure. - -#### Scenario: Headings in reflow mode -- **WHEN** original document contains headings (title, h1, h2, etc.) -- **THEN** headings SHALL be rendered with larger font sizes -- **AND** headings SHALL be visually distinguished from body text -- **AND** heading hierarchy SHALL be preserved - -#### Scenario: Tables in reflow mode -- **WHEN** original document contains tables -- **THEN** tables SHALL render with visible cell borders -- **AND** column widths SHALL auto-adjust to content -- **AND** table content SHALL be fully visible -- **AND** tables SHALL use appropriate cell padding - -#### Scenario: Images in reflow mode -- **WHEN** original document contains images -- **THEN** images SHALL be embedded inline in flowing content -- **AND** images SHALL be scaled to fit page width if necessary -- **AND** images SHALL maintain aspect ratio - -#### Scenario: Lists in reflow mode -- **WHEN** original document contains numbered or bulleted lists -- **THEN** lists SHALL preserve their formatting -- **AND** list items SHALL flow naturally - ---- - -## MODIFIED Requirements - -### Requirement: Translated PDF Export API - -The system SHALL expose an API endpoint for downloading translated documents as PDF files using reflow layout mode only. - -#### Scenario: Download translated PDF via API -- **GIVEN** a task with completed translation -- **WHEN** POST request to `/api/v2/translate/{task_id}/pdf?lang={lang}` -- **THEN** system returns PDF file with translated content -- **AND** PDF SHALL use reflow layout mode (not layout preservation) -- **AND** Content-Type is `application/pdf` -- **AND** Content-Disposition suggests filename like `{task_id}_translated_{lang}.pdf` - -#### Scenario: Translated PDF uses reflow layout -- **WHEN** user downloads translated PDF -- **THEN** the PDF SHALL use reflow layout mode -- **AND** text SHALL flow naturally with consistent font sizes -- **AND** body text SHALL use approximately 12pt font size -- **AND** headings SHALL use larger font sizes (14-18pt) -- **AND** content SHALL be readable without magnification - -#### Scenario: Translated PDF for OCR track -- **GIVEN** document processed via OCR track with translation -- **WHEN** generating translated PDF -- **THEN** reading order SHALL follow `reading_order` array -- **AND** translated text SHALL replace original in correct positions - -#### Scenario: Translated PDF for Direct track -- **GIVEN** document processed via Direct track with translation -- **WHEN** generating translated PDF -- **THEN** reading order SHALL follow implicit element order -- **AND** translated text SHALL replace original in correct positions - -#### Scenario: Invalid language parameter -- **GIVEN** a task with translation only to English -- **WHEN** user requests PDF with `lang=ja` (Japanese) -- **THEN** system returns 404 Not Found -- **AND** response includes available languages in error message - -#### Scenario: Task not found -- **GIVEN** non-existent task_id -- **WHEN** user requests translated PDF -- **THEN** system returns 404 Not Found - ---- - -### Requirement: Frontend Download Options - -The frontend SHALL provide appropriate download options based on translation status. - -#### Scenario: Download options without translation -- **GIVEN** a task without any completed translations -- **WHEN** user views TaskDetailPage -- **THEN** page SHALL display "Download Layout PDF" button (original coordinates) -- **AND** page SHALL display "Download Reflow PDF" button (flowing layout) -- **AND** both options SHALL be available in the download section - -#### Scenario: Download options with translation -- **GIVEN** a task with completed translation -- **WHEN** user views TaskDetailPage -- **THEN** page SHALL display "Download Translated PDF" button for each language -- **AND** translated PDF button SHALL remain as single option (no Layout/Reflow choice) -- **AND** translated PDF SHALL automatically use reflow layout - -#### Scenario: Remove outdated MADLAD-400 references -- **WHEN** displaying translation section -- **THEN** page SHALL NOT display "MADLAD-400" badge -- **AND** description text SHALL reflect cloud translation service (Dify) -- **AND** description SHALL NOT mention local model loading time diff --git a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md deleted file mode 100644 index ee49fb8..0000000 --- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md +++ /dev/null @@ -1,30 +0,0 @@ -## 1. Backend Implementation - -- [x] 1.1 Create `generate_reflow_pdf()` method in pdf_generator_service.py -- [x] 1.2 Implement `_get_elements_in_reading_order()` for both tracks -- [x] 1.3 Implement reflow text rendering with consistent font sizes -- [x] 1.4 Implement table rendering in reflow mode (Platypus Table) -- [x] 1.5 Implement inline image embedding -- [x] 1.6 Add `format=reflow` query parameter to tasks download endpoint -- [x] 1.7 Update `generate_translated_pdf()` to use reflow mode - -## 2. Frontend Implementation - -- [x] 2.1 Add "Download Reflow PDF" button for original documents -- [x] 2.2 Update download logic to support format parameter -- [x] 2.3 Remove MADLAD-400 badge (line 545) -- [x] 2.4 Update translation description text to reflect Dify cloud service (line 652) - -## 3. Testing - -- [x] 3.1 Test OCR track reflow PDF (with reading_order) - Basic smoke test passed -- [ ] 3.2 Test Direct track reflow PDF (implicit order) - No test data available -- [x] 3.3 Test translated PDF (reflow mode) - Basic smoke test passed -- [x] 3.4 Test documents with tables - SUCCESS (62294 bytes, 2 tables) -- [x] 3.5 Test documents with images - SUCCESS (embedded img_in_table) -- [x] 3.6 Test multi-page documents - SUCCESS (11451 bytes, 3 pages) -- [x] 3.7 Verify layout PDF still works correctly - SUCCESS (104543 bytes) - -## 4. Documentation - -- [x] 4.1 Update spec with reflow layout requirements diff --git a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md deleted file mode 100644 index 9bcfd7e..0000000 --- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md +++ /dev/null @@ -1,458 +0,0 @@ -# Design: PDF Preprocessing Pipeline - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ DIRECT Track PDF Processing Pipeline │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Input PDF │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Step 0: GS Distillation (Exception Handler) │ │ -│ │ ─────────────────────────────────────────────────────────────────── │ │ -│ │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors │ │ -│ │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true │ │ -│ │ Status: DISABLED by default, auto-triggered on errors │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Step 1: Object-level Cleaning (P0 - Core) │ │ -│ │ ─────────────────────────────────────────────────────────────────── │ │ -│ │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream │ │ -│ │ 1.2 Remove hidden OCG layers │ │ -│ │ 1.3 White-out detection & removal (IoU >= 80%) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Step 2: Layout Analysis (P1 - Rule-based) │ │ -│ │ ─────────────────────────────────────────────────────────────────── │ │ -│ │ 2.1 get_text("blocks", sort=True) - Column-aware sorting │ │ -│ │ 2.2 Classify elements (title/body/header/footer/page_number) │ │ -│ │ 2.3 Filter unwanted elements (page numbers, decorations) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Step 3: Text Extraction (Enhanced) │ │ -│ │ ─────────────────────────────────────────────────────────────────── │ │ -│ │ 3.1 Extract text with bbox coordinates preserved │ │ -│ │ 3.2 Garble rate detection (cid:xxxx count / total chars) │ │ -│ │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ UnifiedDocument (with bbox for debugging) │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - ---- - -## Step 0: GS Distillation (Exception Handler) - -### Purpose -Repair structurally damaged PDFs that PyMuPDF cannot parse correctly. - -### Trigger Conditions -```python -def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool: - # Condition 1: High garble rate (cid:xxxx patterns) - cid_pattern = r'\(cid:\d+\)' - cid_count = len(re.findall(cid_pattern, page_text)) - total_chars = len(page_text) - garble_rate = cid_count / max(total_chars, 1) - - if garble_rate > 0.1: # >10% garbled - return True - - # Condition 2: Severe structural errors - severe_errors = ['error', 'invalid', 'corrupt', 'damaged'] - for warning in mupdf_warnings: - if any(err in warning.lower() for err in severe_errors): - return True - - return False -``` - -### GS Command -```bash -gs -dNOPAUSE -dBATCH -dSAFER \ - -sDEVICE=pdfwrite \ - -dPDFSETTINGS=/prepress \ - -dDetectDuplicateImages=true \ - -sOutputFile=repaired.pdf \ - input.pdf -``` - -### Implementation Notes -- **Default**: DISABLED -- **Execution**: Only when triggered by error detection -- **Fallback**: If GS also fails, route to Paddle OCR track - ---- - -## Step 1: Object-level Cleaning (P0) - -### 1.1 Content Stream Sanitization -```python -def sanitize_page(page: fitz.Page) -> None: - """Fix malformed PDF content stream.""" - page.clean_contents(sanitize=True) -``` - -### 1.2 Hidden Layer (OCG) Removal -```python -def remove_hidden_layers(doc: fitz.Document) -> List[str]: - """Remove content from hidden Optional Content Groups.""" - removed_layers = [] - - ocgs = doc.get_ocgs() # Get all OCG definitions - for ocg_xref, ocg_info in ocgs.items(): - # Check if layer is hidden by default - if ocg_info.get('on') == False: - removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}')) - # Mark for removal during extraction - - return removed_layers -``` - -### 1.3 White-out Detection (Core Algorithm) -```python -def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]: - """ - Detect text covered by white rectangles ("white-out" / "correction tape" effect). - - Returns list of text words that should be excluded from extraction. - """ - covered_words = [] - - # Get all white-filled rectangles - drawings = page.get_drawings() - white_rects = [] - for d in drawings: - # Check for white fill (RGB all 1.0) - fill_color = d.get('fill') - if fill_color and fill_color == (1, 1, 1): - rect = d.get('rect') - if rect: - white_rects.append(fitz.Rect(rect)) - - if not white_rects: - return covered_words - - # Get all text words with bounding boxes - words = page.get_text("words") # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no) - - for word_info in words: - word_rect = fitz.Rect(word_info[:4]) - word_text = word_info[4] - - for white_rect in white_rects: - # Calculate IoU (Intersection over Union) - intersection = word_rect & white_rect # Intersection - if intersection.is_empty: - continue - - intersection_area = intersection.width * intersection.height - word_area = word_rect.width * word_rect.height - - if word_area > 0: - coverage_ratio = intersection_area / word_area - if coverage_ratio >= iou_threshold: - covered_words.append({ - 'text': word_text, - 'bbox': tuple(word_rect), - 'coverage': coverage_ratio - }) - break # Word is covered, no need to check other rects - - return covered_words -``` - ---- - -## Step 2: Layout Analysis (P1) - -### 2.1 Column-aware Text Extraction -```python -def extract_with_reading_order(page: fitz.Page) -> List[dict]: - """ - Extract text blocks with correct reading order. - PyMuPDF's sort=True handles two-column layouts automatically. - """ - # CRITICAL: sort=True enables column-aware sorting - blocks = page.get_text("dict", sort=True)['blocks'] - return blocks -``` - -### 2.2 Element Classification -```python -def classify_element(block: dict, page_rect: fitz.Rect) -> str: - """ - Classify text block by position and font size. - - Returns: 'title', 'body', 'header', 'footer', 'page_number' - """ - if 'lines' not in block: - return 'image' - - bbox = fitz.Rect(block['bbox']) - page_height = page_rect.height - page_width = page_rect.width - - # Relative position (0.0 = top, 1.0 = bottom) - y_rel = bbox.y0 / page_height - - # Get average font size - font_sizes = [] - for line in block.get('lines', []): - for span in line.get('spans', []): - font_sizes.append(span.get('size', 12)) - avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12 - - # Get text content for pattern matching - text = ''.join( - span.get('text', '') - for line in block.get('lines', []) - for span in line.get('spans', []) - ).strip() - - # Classification rules - - # Header: top 5% of page - if y_rel < 0.05: - return 'header' - - # Footer: bottom 5% of page - if y_rel > 0.95: - return 'footer' - - # Page number: bottom 10% + numeric pattern - if y_rel > 0.90 and _is_page_number(text): - return 'page_number' - - # Title: large font (>14pt) or centered - if avg_font_size > 14: - return 'title' - - # Check if centered (for subtitles) - x_center = (bbox.x0 + bbox.x1) / 2 - page_center = page_width / 2 - if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100: - if avg_font_size > 12: - return 'title' - - return 'body' - - -def _is_page_number(text: str) -> bool: - """Check if text is likely a page number.""" - text = text.strip() - - # Pure number - if text.isdigit(): - return True - - # Common patterns: "Page 1", "- 1 -", "1/10" - patterns = [ - r'^page\s*\d+$', - r'^-?\s*\d+\s*-?$', - r'^\d+\s*/\s*\d+$', - r'^第\s*\d+\s*頁$', - r'^第\s*\d+\s*页$', - ] - - for pattern in patterns: - if re.match(pattern, text, re.IGNORECASE): - return True - - return False -``` - -### 2.3 Element Filtering -```python -def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]: - """Filter out unwanted elements (page numbers, headers, footers).""" - filtered = [] - - for block in blocks: - element_type = classify_element(block, page_rect) - - # Skip page numbers and optionally headers/footers - if element_type == 'page_number': - continue - - # Keep with classification metadata - block['_element_type'] = element_type - filtered.append(block) - - return filtered -``` - ---- - -## Step 3: Text Extraction (Enhanced) - -### 3.1 Garble Detection -```python -def calculate_garble_rate(text: str) -> float: - """ - Calculate the rate of garbled characters (cid:xxxx patterns). - - Returns: float between 0.0 and 1.0 - """ - if not text: - return 0.0 - - # Count (cid:xxxx) patterns - cid_pattern = r'\(cid:\d+\)' - cid_matches = re.findall(cid_pattern, text) - cid_char_count = sum(len(m) for m in cid_matches) - - # Count other garble indicators - # - Replacement character U+FFFD - # - Private Use Area characters - replacement_count = text.count('\ufffd') - pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF) - - total_garble = cid_char_count + replacement_count + pua_count - total_chars = len(text) - - return total_garble / total_chars if total_chars > 0 else 0.0 -``` - -### 3.2 Auto-fallback to OCR -```python -def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool: - """ - Determine if page should be processed with OCR instead of direct extraction. - - Args: - page_text: Extracted text from page - garble_threshold: Maximum acceptable garble rate (default 10%) - - Returns: - True if OCR fallback is recommended - """ - garble_rate = calculate_garble_rate(page_text) - - if garble_rate > garble_threshold: - logger.warning( - f"High garble rate detected: {garble_rate:.1%}. " - f"Recommending OCR fallback." - ) - return True - - return False -``` - ---- - -## Integration Point - -### Modified DirectExtractionEngine._extract_page() - -```python -def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page: - """Extract content from a single page with preprocessing pipeline.""" - - # === Step 1: Object-level Cleaning === - - # 1.1 Sanitize content stream - page.clean_contents(sanitize=True) - - # 1.2 Detect white-out covered text - covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8) - covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words] - - # === Step 2: Layout Analysis === - - # 2.1 Extract with column-aware sorting - blocks = page.get_text("dict", sort=True)['blocks'] - - # 2.2 & 2.3 Classify and filter - filtered_blocks = filter_elements(blocks, page.rect) - - # === Step 3: Text Extraction === - - elements = [] - full_text = "" - - for block in filtered_blocks: - # Skip if block overlaps with covered areas - block_rect = fitz.Rect(block['bbox']) - if any(block_rect.intersects(cr) for cr in covered_bboxes): - continue - - # Extract text with bbox preserved - element = self._block_to_element(block, page_num) - if element: - elements.append(element) - full_text += element.get_text() + " " - - # 3.2 Check garble rate - if should_fallback_to_ocr(full_text): - # Mark page for OCR processing - page_metadata['needs_ocr'] = True - - return Page( - page_number=page_num, - elements=elements, - metadata=page_metadata - ) -``` - ---- - -## Configuration - -```python -@dataclass -class PreprocessingConfig: - """Configuration for PDF preprocessing pipeline.""" - - # Step 0: GS Distillation - gs_enabled: bool = False # Disabled by default - gs_garble_threshold: float = 0.1 # Trigger on >10% garble - gs_detect_duplicate_images: bool = True - - # Step 1: Object Cleaning - sanitize_content: bool = True - remove_hidden_layers: bool = True - whiteout_detection: bool = True - whiteout_iou_threshold: float = 0.8 - - # Step 2: Layout Analysis - column_aware_sort: bool = True # Use sort=True - filter_page_numbers: bool = True - filter_headers: bool = False # Keep headers by default - filter_footers: bool = False # Keep footers by default - - # Step 3: Text Extraction - preserve_bbox: bool = True # For debugging - garble_detection: bool = True - ocr_fallback_threshold: float = 0.1 # Fallback on >10% garble -``` - ---- - -## Testing Strategy - -1. **Unit Tests** - - White-out detection with synthetic PDFs - - Garble rate calculation - - Element classification accuracy - -2. **Integration Tests** - - Two-column document reading order - - Hidden layer removal - - GS fallback trigger conditions - -3. **Regression Tests** - - Existing task outputs should not change for clean PDFs - - Performance benchmarks (should add <100ms per page) diff --git a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md deleted file mode 100644 index 821137a..0000000 --- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md +++ /dev/null @@ -1,44 +0,0 @@ -# Change Proposal: PDF Preprocessing Pipeline - -## Summary - -Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order. - -## Problem Statement - -Current Direct track extraction has several issues: -1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction -2. **Reading order chaos**: Two-column layouts get interleaved incorrectly -3. **Vector graphics interference**: Large decorative vector elements cover text content -4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text - -## Proposed Solution - -Implement a 4-stage preprocessing pipeline: - -``` -Step 0: GS Distillation (Exception Handler - triggered on errors) -Step 1: Object-level Cleaning (P0 - Core) -Step 2: Layout Analysis (P1 - Rule-based with sort=True) -Step 3: Text Extraction (Existing, enhanced with garble detection) -``` - -## Key Features - -1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors -2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text -3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling -4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold - -## Impact - -- **Files Modified**: `backend/app/services/direct_extraction_engine.py` -- **New Dependencies**: None (Ghostscript optional, already available on most systems) -- **Risk Level**: Medium (core extraction logic changes) - -## Success Criteria - -- [ ] Hidden OCG content no longer appears in extraction -- [ ] White-out covered text is correctly filtered -- [ ] Two-column documents maintain correct reading order -- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR diff --git a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md deleted file mode 100644 index 7ddde20..0000000 --- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md +++ /dev/null @@ -1,93 +0,0 @@ -# Tasks: PDF Preprocessing Pipeline - -## Phase 1: Object-level Cleaning (P0) - -### Step 1.1: Content Sanitization -- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()` -- [x] Add error handling for malformed content streams -- [x] Add logging for sanitization actions - -### Step 1.2: Hidden Layer (OCG) Removal -- [x] Implement `get_hidden_ocg_layers()` function -- [ ] Add OCG content filtering during extraction (deferred - needs test case) -- [x] Add configuration option `remove_hidden_layers` -- [x] Add logging for removed layers - -### Step 1.3: White-out Detection -- [x] Implement `detect_whiteout_covered_text()` with IoU calculation -- [x] Add white rectangle detection from `page.get_drawings()` -- [x] Integrate covered text filtering into extraction -- [x] Add configuration option `whiteout_iou_threshold` (default 0.8) -- [x] Add logging for detected white-out regions - -## Phase 2: Layout Analysis (P1) - -### Step 2.1: Column-aware Sorting -- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented) -- [x] Verify reading order improvement on test documents -- [ ] Add configuration option `column_aware_sort` (deferred - low priority) - -### Step 2.2: Element Classification -- [ ] Implement `classify_element()` function (deferred - existing detection sufficient) -- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()` -- [x] Add font-size-based classification (title detection) - via existing logic -- [x] Add page number pattern detection `_is_page_number()` -- [ ] Preserve classification in element metadata `_element_type` (deferred) - -### Step 2.3: Element Filtering -- [x] Implement `filter_elements()` function - `_filter_page_numbers()` -- [x] Add configuration options for filtering (page_numbers, headers, footers) -- [x] Add logging for filtered elements - -## Phase 3: Enhanced Extraction (P1) - -### Step 3.1: Bbox Preservation -- [x] Ensure all extracted elements retain bbox coordinates (already implemented) -- [x] Add bbox to UnifiedDocument element metadata -- [x] Verify bbox accuracy in generated output - -### Step 3.2: Garble Detection -- [x] Implement `calculate_garble_rate()` function -- [x] Detect `(cid:xxxx)` patterns -- [x] Detect replacement characters (U+FFFD) -- [x] Detect Private Use Area characters -- [x] Add garble rate to page metadata - -### Step 3.3: OCR Fallback -- [x] Implement `should_fallback_to_ocr()` decision function -- [x] Add configuration option `ocr_fallback_threshold` (default 0.1) -- [x] Add `get_pages_needing_ocr()` interface for callers -- [x] Add `get_extraction_quality_report()` for quality metrics -- [x] Add logging for fallback decisions - -## Phase 4: GS Distillation - Exception Handler (P2) - -### Step 0: GS Repair (Optional) -- [x] Implement `should_trigger_gs_repair()` trigger detection -- [x] Implement `repair_pdf_with_gs()` function -- [x] Add `-dDetectDuplicateImages=true` option -- [x] Add temporary file handling for repaired PDF -- [x] Implement `is_ghostscript_available()` check -- [x] Add `extract_with_repair()` method -- [x] Add fallback to normal extraction if GS not available -- [x] Add logging for GS repair actions - -## Testing - -### Unit Tests -- [ ] Test white-out detection with synthetic PDF -- [x] Test garble rate calculation -- [ ] Test element classification accuracy -- [x] Test page number pattern detection - -### Integration Tests -- [x] Test with demo_docs/edit.pdf (3 pages) -- [x] Test with demo_docs/edit2.pdf (1 page) -- [x] Test with demo_docs/edit3.pdf (2 pages) -- [x] Test quality report generation -- [x] Test GS availability check -- [x] Test end-to-end pipeline with real documents - -### Regression Tests -- [x] Verify existing clean PDFs produce same output -- [ ] Performance benchmark (<100ms overhead per page) diff --git a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md deleted file mode 100644 index b8867d5..0000000 --- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md +++ /dev/null @@ -1,73 +0,0 @@ -# Change: Fix OCR Track Cell Over-Detection - -## Why - -PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in: -- 4 tables detected instead of 1 on sample document -- 105 cells detected instead of 12 (expected) -- Broken text layout and incorrect font sizing in PDF output -- Poor document reconstruction quality compared to Direct Track - -Evidence from task comparison: -- Direct Track (`cfd996d9`): 1 table, 12 cells - correct representation -- OCR Track (`62de32e0`): 4 tables, 105 cells - severe over-detection - -## What Changes - -- Add post-detection cell validation pipeline to filter false-positive cells -- Implement table structure validation using geometric patterns -- Add text density analysis to distinguish tables from key-value text -- Apply stricter confidence thresholds for cell detection -- Add cell clustering algorithm to identify isolated false-positive cells - -## Root Cause Analysis - -PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1: - -| Table | Cells | Density (cells/10000px²) | Avg Cell Area | Status | -|-------|-------|--------------------------|---------------|--------| -| 1 | 13 | 0.87 | 11,550 px² | Normal | -| 2 | 12 | 0.44 | 22,754 px² | Normal | -| **3** | **51** | **6.22** | **1,609 px²** | **Over-detected** | -| 4 | 29 | 0.94 | 10,629 px² | Normal | - -**Table 3 anomalies:** -- Cell density 7-14x higher than normal tables -- Average cell area only 7-14% of normal -- 150px height with 51 cells = ~3px per cell row (impossible) - -## Proposed Solution: Post-Detection Cell Validation - -Apply metric-based filtering after PP-Structure detection: - -### Filter 1: Cell Density Check -- **Threshold**: Reject tables with density > 3.0 cells/10000px² -- **Rationale**: Normal tables have 0.4-1.0 density; over-detected have 6+ - -### Filter 2: Minimum Cell Area -- **Threshold**: Reject tables with average cell area < 3,000 px² -- **Rationale**: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px² - -### Filter 3: Cell Height Validation -- **Threshold**: Reject if (table_height / cell_count) < 10px -- **Rationale**: Each cell row needs minimum height for readable text - -### Filter 4: Reclassification -- Tables failing validation are reclassified as TEXT elements -- Original text content is preserved -- Reading order is recalculated - -## Impact - -- Affected specs: `ocr-processing` -- Affected code: - - `backend/app/services/ocr_service.py` - Add cell validation pipeline - - `backend/app/services/processing_orchestrator.py` - Integrate validation - - New file: `backend/app/services/cell_validation_engine.py` - -## Success Criteria - -1. OCR Track cell count matches Direct Track within 10% tolerance -2. No false-positive tables detected from non-tabular content -3. Table structure maintains logical row/column alignment -4. PDF output quality comparable to Direct Track for documents with tables diff --git a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md deleted file mode 100644 index 5eeea5a..0000000 --- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md +++ /dev/null @@ -1,64 +0,0 @@ -## ADDED Requirements - -### Requirement: Cell Over-Detection Filtering - -The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells. - -#### Scenario: Cell density exceeds threshold -- **GIVEN** a table detected by PP-StructureV3 with cell_boxes -- **WHEN** cell density exceeds 3.0 cells per 10,000 px² -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Average cell area below threshold -- **GIVEN** a table detected by PP-StructureV3 -- **WHEN** average cell area is less than 3,000 px² -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Cell height too small -- **GIVEN** a table with height H and N cells -- **WHEN** (H / N) is less than 10 pixels -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Valid tables are preserved -- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10) -- **WHEN** validation is applied -- **THEN** the table SHALL be preserved unchanged -- **AND** all cell_boxes SHALL be retained - -### Requirement: Table-to-Text Reclassification - -The system SHALL convert over-detected tables to TEXT elements while preserving content. - -#### Scenario: Table content is preserved -- **GIVEN** a table flagged for reclassification -- **WHEN** converting to TEXT element -- **THEN** the system SHALL extract text content from table HTML -- **AND** preserve the original bounding box -- **AND** set element type to TEXT - -#### Scenario: Reading order is recalculated -- **GIVEN** tables have been reclassified as TEXT -- **WHEN** assembling the final page structure -- **THEN** the system SHALL recalculate reading order -- **AND** sort elements by y0 then x0 coordinates - -### Requirement: Validation Configuration - -The system SHALL provide configurable thresholds for cell validation. - -#### Scenario: Default thresholds are applied -- **GIVEN** no custom configuration is provided -- **WHEN** validating tables -- **THEN** the system SHALL use default thresholds: - - max_cell_density: 3.0 cells/10000px² - - min_avg_cell_area: 3000 px² - - min_cell_height: 10 px - -#### Scenario: Custom thresholds can be configured -- **GIVEN** custom validation thresholds in configuration -- **WHEN** validating tables -- **THEN** the system SHALL use the custom values -- **AND** apply them consistently to all pages diff --git a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md deleted file mode 100644 index f903310..0000000 --- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md +++ /dev/null @@ -1,124 +0,0 @@ -# Tasks: Fix OCR Track Cell Over-Detection - -## Root Cause Analysis Update - -**Original assumption:** PP-Structure was over-detecting cells. - -**Actual root cause:** cell_boxes from `table_res_list` were being assigned to WRONG tables when HTML matching failed. The fallback used "first available" instead of bbox matching, causing: -- Table A's cell_boxes assigned to Table B -- False over-detection metrics (density 6.22 vs actual 1.65) -- Incorrect reclassification as TEXT - -## Phase 1: Cell Validation Engine - -- [x] 1.1 Create `cell_validation_engine.py` with metric-based validation -- [x] 1.2 Implement cell density calculation (cells per 10000px²) -- [x] 1.3 Implement average cell area calculation -- [x] 1.4 Implement cell height validation (table_height / cell_count) -- [x] 1.5 Add configurable thresholds with defaults: - - max_cell_density: 3.0 cells/10000px² - - min_avg_cell_area: 3000 px² - - min_cell_height: 10px -- [ ] 1.6 Unit tests for validation functions - -## Phase 2: Table Reclassification - -- [x] 2.1 Implement table-to-text reclassification logic -- [x] 2.2 Preserve original text content from HTML table -- [x] 2.3 Create TEXT element with proper bbox -- [x] 2.4 Recalculate reading order after reclassification - -## Phase 3: Integration - -- [x] 3.1 Integrate validation into OCR service pipeline (after PP-Structure) -- [x] 3.2 Add validation before cell_boxes processing -- [x] 3.3 Add debug logging for filtered tables -- [ ] 3.4 Update processing metadata with filter statistics - -## Phase 3.5: cell_boxes Matching Fix (NEW) - -- [x] 3.5.1 Fix cell_boxes matching in pp_structure_enhanced.py to use bbox overlap instead of "first available" -- [x] 3.5.2 Calculate IoU between table_res cell_boxes bounding box and layout element bbox -- [x] 3.5.3 Match tables with >10% overlap, log match quality -- [x] 3.5.4 Update validate_cell_boxes to also check table bbox boundaries, not just page boundaries - -**Results:** -- OLD: cell_boxes mismatch caused false over-detection (density=6.22) -- NEW: correct bbox matching (overlap=0.97-0.98), actual metrics (density=1.06-1.65) - -## Phase 4: Testing - -- [x] 4.1 Test with edit.pdf (sample with over-detection) -- [x] 4.2 Verify Table 3 (51 cells) - now correctly matched with density=1.65 (within threshold) -- [x] 4.3 Verify Tables 1, 2, 4 remain as tables -- [x] 4.4 Compare PDF output quality before/after -- [ ] 4.5 Regression test on other documents - -## Phase 5: cell_boxes Quality Check (NEW - 2025-12-07) - -**Problem:** PP-Structure's cell_boxes don't always form proper grids. Some tables have -overlapping cells (18-23% of cell pairs overlap), causing messy overlapping borders in PDF. - -**Solution:** Added cell overlap quality check in `_draw_table_with_cell_boxes()`: - -- [x] 5.1 Count overlapping cell pairs in cell_boxes -- [x] 5.2 Calculate overlap ratio (overlapping pairs / total pairs) -- [x] 5.3 If overlap ratio > 10%, skip cell_boxes rendering and use ReportLab Table fallback -- [x] 5.4 Text inside table regions filtered out to prevent duplicate rendering - -**Test Results (task_id: 5e04bd00-a7e4-4776-8964-0a56eaf608d8):** -- Table pp3_0_3 (13 cells): 10/78 pairs (12.8%) overlap → ReportLab fallback -- Table pp3_0_6 (29 cells): 94/406 pairs (23.2%) overlap → ReportLab fallback -- Table pp3_0_7 (12 cells): No overlap issue → Grid-based line drawing -- Table pp3_0_16 (51 cells): 233/1275 pairs (18.3%) overlap → ReportLab fallback -- 26 text regions inside tables filtered out to prevent duplicate rendering - -## Phase 6: Fix Double Rendering of Text Inside Tables (2025-12-07) - -**Problem:** Text inside table regions was rendered twice: -1. Via layout/HTML table rendering -2. Via raw OCR text_regions (because `regions_to_avoid` excluded tables) - -**Root Cause:** In `pdf_generator_service.py:1162-1169`: -```python -regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table'] -``` -This intentionally excluded tables from filtering, causing text overlap. - -**Solution:** -- [x] 6.1 Include tables in `regions_to_avoid` to filter text inside table bboxes -- [x] 6.2 Test PDF output with fix applied -- [x] 6.3 Verify no blank areas where tables should have content - -**Test Results (task_id: 2d788fca-c824-492b-95cb-35f2fedf438d):** -- PDF size reduced 18% (59,793 → 48,772 bytes) -- Text content reduced 66% (14,184 → 4,829 chars) - duplicate text eliminated -- Before: "PRODUCT DESCRIPTION" appeared twice, table values duplicated -- After: Content appears only once, clean layout -- Table content preserved correctly via HTML table rendering - -## Phase 7: Smart Table Rendering Based on cell_boxes Quality (2025-12-07) - -**Problem:** Phase 6 fix caused content to be largely missing because all tables were -excluded from text rendering, but tables with bad cell_boxes quality had their content -rendered via ReportLab Table fallback which might not preserve text accurately. - -**Solution:** Smart rendering based on cell_boxes quality: -- Good quality cell_boxes (≤10% overlap) → Filter text, render via cell_boxes -- Bad quality cell_boxes (>10% overlap) → Keep raw OCR text, draw table border only - -**Implementation:** -- [x] 7.1 Add `_check_cell_boxes_quality()` to assess cell overlap ratio -- [x] 7.2 Add `_draw_table_border_only()` for border-only rendering -- [x] 7.3 Modify smart filtering in `_generate_pdf_from_data()`: - - Good quality tables → add to `regions_to_avoid` - - Bad quality tables → mark with `_use_border_only=True` -- [x] 7.4 Add `element_id` to `table_element` in `convert_unified_document_to_ocr_data()` - (was missing, causing `_use_border_only` flag mismatch) -- [x] 7.5 Modify `draw_table_region()` to check `_use_border_only` flag - -**Test Results (task_id: 82c7269f-aff0-493b-adac-5a87248cd949, scan.pdf):** -- Tables pp3_0_3 and pp3_0_4 identified as bad quality → border-only rendering -- Raw OCR text preserved and rendered at original positions -- PDF output: 62,998 bytes with all text content visible -- Logs confirm: `[TABLE] pp3_0_3: Drew border only (bad cell_boxes quality)` diff --git a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/design.md b/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/design.md deleted file mode 100644 index 934b1c1..0000000 --- a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/design.md +++ /dev/null @@ -1,240 +0,0 @@ -# Design: Refactor Dual-Track Architecture - -## Context - -Tool_OCR 是一個雙軌制文件處理系統,支援: -- **Direct Track**: 從可編輯 PDF 直接提取結構化內容 -- **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別 - -目前系統存在以下技術債務: -- OCRService (2,326 行) 承擔過多職責 -- PDFGeneratorService (4,644 行) 是單體服務 -- 記憶體管理分散在多個組件中 -- 已知 bug 影響輸出品質 - -## Goals / Non-Goals - -### Goals -- 修復 PLAN.md 中列出的所有已知 bug -- 將 OCRService 拆分為 < 800 行的可維護單元 -- 將 PDFGeneratorService 拆分為 < 2,000 行 -- 簡化記憶體管理配置 -- 提升前端狀態管理一致性 - -### Non-Goals -- 不改變現有 API 契約 -- 不引入新的外部依賴 -- 不改變資料庫 schema -- 不改變使用者介面 - -## Decisions - -### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測 - -**選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API - -**理由**: -- PyMuPDF 的表格檢測能正確識別合併單元格 -- 返回的 `table.cells` 結構包含 span 資訊 -- 減少自定義代碼維護負擔 - -**替代方案**: -- 改進 `_detect_tables_by_position()` 算法 - - 優點:不依賴外部 API 變更 - - 缺點:複雜度高,難以處理所有邊界情況 -- 使用 Camelot 或 Tabula - - 優點:成熟的表格提取庫 - - 缺點:引入新依賴,增加系統複雜度 - -### Decision 2: 使用 Strategy Pattern 重構服務層 - -**選擇**: 引入 ProcessingOrchestrator 使用策略模式 - -```python -class ProcessingPipeline(Protocol): - def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument: - ... - -class DirectPipeline(ProcessingPipeline): - def __init__(self, extraction_engine: DirectExtractionEngine): - self.engine = extraction_engine - - def process(self, file_path, options): - return self.engine.extract(file_path) - -class OCRPipeline(ProcessingPipeline): - def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService): - self.ocr = ocr_service - self.preprocessor = preprocessor - - def process(self, file_path, options): - # Preprocessing + OCR + Conversion - ... - -class ProcessingOrchestrator: - def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]): - self.detector = detector - self.pipelines = pipelines - - def process(self, file_path, options): - track = options.force_track or self.detector.detect(file_path).track - return self.pipelines[track].process(file_path, options) -``` - -**理由**: -- 職責分離:檢測、處理、轉換各自獨立 -- 易於測試:可以單獨測試每個 Pipeline -- 易於擴展:新增處理方式只需添加新 Pipeline - -**替代方案**: -- 使用 Chain of Responsibility - - 優點:更靈活的處理鏈 - - 缺點:對於二選一的場景過於複雜 -- 保持現狀,只做代碼整理 - - 優點:風險最低 - - 缺點:無法解決根本問題 - -### Decision 3: 分層提取 PDF 生成邏輯 - -**選擇**: 將 PDFGeneratorService 拆分為三個模組 - -``` -PDFGeneratorService (主要編排) -├── PDFTableRenderer (表格渲染) -│ ├── HTMLTableParser (HTML 表格解析) -│ └── CellRenderer (單元格渲染) -├── PDFFontManager (字體管理) -│ ├── FontLoader (字體載入) -│ └── FontFallback (字體 fallback) -└── PDFLayoutEngine (版面配置) -``` - -**理由**: -- 單一職責:每個模組專注一件事 -- 可重用:FontManager 可被其他服務使用 -- 易於測試:表格渲染可獨立測試 - -### Decision 4: 統一記憶體策略引擎 - -**選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine - -```python -class MemoryPolicyEngine: - """統一的記憶體策略引擎""" - - def __init__(self, config: MemoryConfig): - self.config = config - self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions) - - @property - def gpu_usage_percent(self) -> float: - # 統一的 GPU 使用率查詢 - ... - - def check_availability(self) -> MemoryStatus: - # 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY - ... - - async def acquire_prediction_slot(self): - # 統一的並發控制 - ... - - def cleanup_if_needed(self): - # 根據狀態自動清理 - ... - -@dataclass -class MemoryConfig: - warning_threshold: float = 0.80 # 80% - critical_threshold: float = 0.95 # 95% - max_concurrent_predictions: int = 2 - model_idle_timeout: int = 300 # 5 minutes -``` - -**理由**: -- 減少配置項:從 8+ 降到 4 個核心配置 -- 簡化依賴:服務只需依賴一個記憶體引擎 -- 統一行為:所有記憶體決策在同一處做出 - -### Decision 5: 使用 Zustand 管理任務狀態 - -**選擇**: 新增 TaskStore 統一管理任務狀態 - -```typescript -interface TaskState { - currentTaskId: string | null; - tasks: Record; - processingStatus: Record; -} - -interface TaskActions { - setCurrentTask: (taskId: string) => void; - updateTask: (taskId: string, updates: Partial) => void; - updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void; - clearTasks: () => void; -} - -const useTaskStore = create()( - persist( - (set) => ({ - currentTaskId: null, - tasks: {}, - processingStatus: {}, - // ... actions - }), - { name: 'task-storage' } - ) -); -``` - -**理由**: -- 一致性:與現有 uploadStore、authStore 模式一致 -- 可追蹤:任務狀態變更集中管理 -- 持久化:刷新頁面後狀態保留 - -## Risks / Trade-offs - -| 風險 | 影響 | 緩解措施 | -|------|------|----------| -| PyMuPDF find_tables() API 變更 | 中 | 封裝為獨立函數,易於替換 | -| 服務重構導致處理邏輯錯誤 | 高 | 保留原有測試,逐步重構 | -| 記憶體引擎改變導致 OOM | 高 | 使用相同閾值,僅改變代碼結構 | -| 前端狀態遷移導致 bug | 中 | 逐頁遷移,完整測試每個頁面 | - -## Migration Plan - -### Step 1: Bug Fixes (可獨立部署) -1. 實現 PyMuPDF find_tables() 整合 -2. 修復 OCR Track 圖片路徑 -3. 添加 cell_boxes 座標驗證 -4. 測試並部署 - -### Step 2: Service Refactoring (可獨立部署) -1. 提取 ProcessingOrchestrator -2. 提取 TableRenderer 和 FontManager -3. 更新 OCRService 使用新組件 -4. 測試並部署 - -### Step 3: Memory Management (可獨立部署) -1. 實現 MemoryPolicyEngine -2. 逐步遷移服務使用新引擎 -3. 移除舊組件 -4. 測試並部署 - -### Step 4: Frontend Improvements (可獨立部署) -1. 新增 TaskStore -2. 遷移 ProcessingPage -3. 遷移 TaskDetailPage -4. 合併類型定義 -5. 測試並部署 - -### Rollback Plan -- 每個 Step 獨立部署,問題時可回滾到上一個穩定版本 -- Bug fixes 優先,確保基本功能正確 -- 重構不改變外部行為,回滾影響最小 - -## Open Questions - -1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API -2. **前端狀態持久化範圍**: 是否所有任務都需要持久化,還是只保留當前會話? -3. **記憶體閾值調整**: 現有閾值是否經過生產驗證,可以直接沿用? diff --git a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/proposal.md b/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/proposal.md deleted file mode 100644 index fdb89a9..0000000 --- a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/proposal.md +++ /dev/null @@ -1,68 +0,0 @@ -# Change: Refactor Dual-Track Architecture - -## Why - -目前雙軌制 OCR 系統存在多個已知問題和架構債務: - -1. **Direct Track 表格問題**: `_detect_tables_by_position()` 無法識別合併單元格,導致 edit3.pdf 產生 204 個錯誤拆分的 cells(應為 83 個) -2. **OCR Track 圖片路徑丟失**: CHART/DIAGRAM 等視覺元素的 `saved_path` 在轉換時丟失,導致圖片未放回 PDF -3. **OCR Track cell_boxes 座標錯亂**: PP-StructureV3 返回的 cell_boxes 超出頁面邊界 -4. **服務層過度複雜**: OCRService (2,326 行) 承擔過多職責,難以維護和測試 -5. **PDF 生成器過於龐大**: PDFGeneratorService (4,644 行) 是單體服務,難以擴展 - -## What Changes - -### Phase 1: 修復已知 Bug(優先級:最高) - -- **Direct Track 表格修復**: 改用 PyMuPDF `find_tables()` API 取代 `_detect_tables_by_position()` -- **OCR Track 圖片路徑修復**: 擴展 `_convert_pp3_element` 處理所有視覺元素類型 (IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP) -- **Cell boxes 座標驗證**: 添加邊界檢查,超出範圍時使用 CV 線檢測 fallback -- **過濾極小裝飾圖片**: 過濾 < 200 px² 的圖片 -- **移除覆蓋圖像**: 在渲染階段過濾與 covering_images 重疊的圖片 - -### Phase 2: 服務層重構(優先級:高) - -- **拆分 OCRService**: 提取獨立的 `ProcessingOrchestrator` 負責流程編排 -- **建立 Pipeline 模式**: 使用組合模式取代目前的聚合模式 -- **提取 TableRenderer**: 從 PDFGeneratorService 提取表格渲染邏輯 -- **提取 FontManager**: 從 PDFGeneratorService 提取字體管理邏輯 - -### Phase 3: 記憶體管理簡化(優先級:中) - -- **統一記憶體策略**: 合併 MemoryManager、MemoryGuard、各類 Semaphore 為單一策略引擎 -- **簡化配置**: 減少 8+ 個記憶體相關配置項到核心 3-4 項 - -### Phase 4: 前端狀態管理改進(優先級:中) - -- **新增 TaskStore**: 使用 Zustand 管理任務狀態,取代分散的 useState -- **合併類型定義**: 統一 api.ts 和 apiV2.ts 為單一類型定義檔案 - -## Impact - -- Affected specs: `document-processing` -- Affected code: - - `backend/app/services/direct_extraction_engine.py` (表格檢測) - - `backend/app/services/ocr_to_unified_converter.py` (元素轉換) - - `backend/app/services/ocr_service.py` (服務編排) - - `backend/app/services/pdf_generator_service.py` (PDF 生成) - - `backend/app/services/memory_manager.py` (記憶體管理) - - `frontend/src/store/` (狀態管理) - - `frontend/src/types/` (類型定義) - -## Risk Assessment - -| 風險 | 嚴重性 | 緩解措施 | -|------|--------|----------| -| 表格渲染回歸 | 高 | 使用 edit.pdf 和 edit3.pdf 作為回歸測試 | -| 記憶體管理變更導致 OOM | 高 | 保留現有閾值,僅重構代碼結構 | -| 服務重構導致處理失敗 | 中 | 逐步重構,每階段完整測試 | - -## Success Metrics - -| 指標 | 目前 | 目標 | -|------|------|------| -| edit3.pdf Direct Track cells | 204 (錯誤) | 83 (正確) | -| OCR Track 圖片放回率 | 0% | 100% | -| cell_boxes 座標正確率 | ~40% | 100% | -| OCRService 行數 | 2,326 | < 800 | -| PDFGeneratorService 行數 | 4,644 | < 2,000 | diff --git a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/specs/document-processing/spec.md deleted file mode 100644 index b8b46be..0000000 --- a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/specs/document-processing/spec.md +++ /dev/null @@ -1,153 +0,0 @@ -# document-processing Specification Delta - -## ADDED Requirements - -### Requirement: Table Cell Merging Detection -The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents. - -#### Scenario: Detect merged cells in Direct Track -- **WHEN** extracting tables from an editable PDF using Direct Track -- **THEN** the system SHALL use PyMuPDF find_tables() API -- **AND** correctly identify cells with rowspan > 1 or colspan > 1 -- **AND** preserve merge information in UnifiedDocument table structure -- **AND** skip placeholder cells that are covered by merged cells - -#### Scenario: Handle complex table structures -- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges) -- **THEN** the system SHALL NOT split merged cells into individual cells -- **AND** the output cell count SHALL match the actual visual cell count -- **AND** the rendered PDF SHALL display correct merged cell boundaries - -### Requirement: Visual Element Path Preservation -The system SHALL preserve image paths for all visual element types during OCR conversion. - -#### Scenario: Preserve CHART element paths -- **WHEN** converting PP-StructureV3 output containing CHART elements -- **THEN** the system SHALL treat CHART as a visual element type -- **AND** extract saved_path from the element data -- **AND** include saved_path in the UnifiedDocument content field - -#### Scenario: Support all visual element types -- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP -- **THEN** the system SHALL extract saved_path or img_path for each element -- **AND** preserve path, width, height, and format in content dictionary -- **AND** enable downstream PDF generation to embed these images - -#### Scenario: Fallback path resolution -- **WHEN** a visual element has multiple path fields (saved_path, img_path) -- **THEN** the system SHALL prefer saved_path over img_path -- **AND** fallback to img_path if saved_path is missing -- **AND** log warning if both paths are missing - -### Requirement: Cell Box Coordinate Validation -The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases. - -#### Scenario: Detect out-of-bounds coordinates -- **WHEN** processing cell_boxes from PP-StructureV3 -- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height) -- **AND** log tables with coordinates exceeding page bounds -- **AND** mark affected cells for fallback processing - -#### Scenario: Apply CV line detection fallback -- **WHEN** cell_boxes coordinates are invalid (out of bounds) -- **THEN** the system SHALL apply OpenCV line detection as fallback -- **AND** reconstruct table structure from detected lines -- **AND** include fallback_used flag in table metadata - -#### Scenario: Coordinate normalization -- **WHEN** coordinates are within page bounds but slightly outside table bbox -- **THEN** the system SHALL clamp coordinates to table boundaries -- **AND** preserve relative cell positions -- **AND** ensure no cells overlap after normalization - -### Requirement: Decoration Image Filtering -The system SHALL filter out minimal decoration images that do not contribute meaningful content. - -#### Scenario: Filter tiny images by area -- **WHEN** extracting images from a document -- **THEN** the system SHALL calculate image area (width x height) -- **AND** filter out images with area < 200 square pixels -- **AND** log filtered image count for debugging - -#### Scenario: Configurable filtering threshold -- **WHEN** processing documents with intentionally small images -- **THEN** the system SHALL support configuration of minimum image area threshold -- **AND** default to 200 square pixels if not specified -- **AND** allow threshold = 0 to disable filtering - -### Requirement: Covering Image Removal -The system SHALL remove covering/redaction images from the final output. - -#### Scenario: Detect covering rectangles -- **WHEN** preprocessing a PDF page -- **THEN** the system SHALL detect black/white rectangles covering text regions -- **AND** identify covering images by high IoU (> 0.8) with underlying content -- **AND** mark covering images for exclusion - -#### Scenario: Exclude covering images from rendering -- **WHEN** generating output PDF -- **THEN** the system SHALL exclude images marked as covering -- **AND** preserve the text content that was covered -- **AND** include covering_images_removed count in metadata - -#### Scenario: Handle both black and white covering -- **WHEN** detecting covering rectangles -- **THEN** the system SHALL detect both black fill (redaction style) -- **AND** white fill (whiteout style) -- **AND** low-contrast rectangles intended to hide content - -## MODIFIED Requirements - -### Requirement: Enhanced OCR with Full PP-StructureV3 -The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates. - -#### Scenario: Extract comprehensive document structure -- **WHEN** processing through OCR track -- **THEN** the system SHALL use page_result.json['parsing_res_list'] -- **AND** extract all element types including headers, lists, tables, figures -- **AND** preserve layout_bbox coordinates for each element - -#### Scenario: Maintain reading order -- **WHEN** extracting elements from PP-StructureV3 -- **THEN** the system SHALL preserve the reading order from parsing_res_list -- **AND** assign sequential indices to elements -- **AND** support reordering for complex layouts - -#### Scenario: Extract table structure -- **WHEN** PP-StructureV3 identifies a table -- **THEN** the system SHALL extract cell content and boundaries -- **AND** validate cell_boxes coordinates against page boundaries -- **AND** apply fallback detection for invalid coordinates -- **AND** preserve table HTML for structure -- **AND** extract plain text for translation - -#### Scenario: Extract visual elements with paths -- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM) -- **THEN** the system SHALL preserve saved_path for each element -- **AND** include image dimensions and format -- **AND** enable image embedding in output PDF - -## ADDED Requirements - -### Requirement: Generate UnifiedDocument from direct extraction -The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging. - -#### Scenario: Extract tables with cell merging -- **WHEN** direct extraction encounters a table -- **THEN** the system SHALL use PyMuPDF find_tables() API -- **AND** extract cell content with correct rowspan/colspan -- **AND** preserve merged cell boundaries -- **AND** skip placeholder cells covered by merges - -#### Scenario: Filter decoration images -- **WHEN** extracting images from PDF -- **THEN** the system SHALL filter images smaller than minimum area threshold -- **AND** exclude covering/redaction images -- **AND** preserve meaningful content images - -#### Scenario: Preserve text styling with image handling -- **WHEN** direct extraction completes -- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument -- **AND** preserve text styling, fonts, and exact positioning -- **AND** extract tables with cell boundaries, content, and merge info -- **AND** include only meaningful images in output diff --git a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/tasks.md b/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/tasks.md deleted file mode 100644 index e0087c3..0000000 --- a/openspec/changes/archive/2025-12-08-refactor-dual-track-architecture/tasks.md +++ /dev/null @@ -1,110 +0,0 @@ -# Tasks: Refactor Dual-Track Architecture - -## Phase 1: 修復已知 Bug (已完成) - -### 1.1 Direct Track 表格修復 (已完成 ✓) -- [x] 1.1.1 修改 `_process_native_table()` 方法使用 `table.cells` 處理合併單元格 -- [x] 1.1.2 使用 PyMuPDF `page.find_tables()` API (已在使用中) -- [x] 1.1.3 解析 `table.cells` 並正確計算 `row_span`/`col_span` -- [x] 1.1.4 處理被合併的單元格(跳過 `None` 值,建立 covered grid) -- [x] 1.1.5 驗證 edit3.pdf 返回 83 個正確的 cells ✓ - -### 1.2 OCR Track 圖片路徑修復 (已完成 ✓) -- [x] 1.2.1 修改 `ocr_to_unified_converter.py` 第 604-613 行 -- [x] 1.2.2 擴展視覺元素類型判斷:`IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP` -- [x] 1.2.3 優先使用 `saved_path`,fallback 到 `img_path` -- [x] 1.2.4 確保 content dict 包含 `saved_path`, `path`, `width`, `height`, `format` -- [x] 1.2.5 程式碼已修正 (需 OCR Track 完整測試驗證) -- [x] 1.2.6 程式碼已修正 (需 OCR Track 完整測試驗證) - -### 1.3 Cell boxes 座標驗證 (已完成 ✓) -- [x] 1.3.1 在 `ocr_to_unified_converter.py` 添加 `validate_cell_boxes()` 函數 -- [x] 1.3.2 檢查 cell_boxes 是否超出頁面邊界 (0, 0, page_width, page_height) -- [x] 1.3.3 超出範圍時使用 clamped coordinates,標記 needs_fallback -- [x] 1.3.4 添加日誌記錄異常座標 -- [x] 1.3.5 單元測試驗證座標驗證邏輯正確 ✓ - -### 1.4 過濾極小裝飾圖片 (已完成 ✓) -- [x] 1.4.1 在 `direct_extraction_engine.py` 圖片提取邏輯添加面積檢查 -- [x] 1.4.2 過濾 `image_area < min_image_area` (默認 200 px²) 的圖片 -- [x] 1.4.3 添加 `min_image_area` 配置項允許調整閾值 -- [x] 1.4.4 驗證 edit3.pdf 偵測到 3 個極小裝飾圖片 ✓ - -### 1.5 移除覆蓋圖像 (已完成 ✓) -- [x] 1.5.1 傳遞 `covering_images` 到 `_extract_images()` 方法 -- [x] 1.5.2 使用 IoU 閾值 (0.8) 和 xref 比對判斷覆蓋圖像 -- [x] 1.5.3 從最終輸出中排除覆蓋圖像 -- [x] 1.5.4 添加 `_calculate_iou()` 輔助方法 -- [x] 1.5.5 驗證 edit3.pdf 偵測到 6 個黑框覆蓋圖像 ✓ - -## Phase 2: 服務層重構 (已完成) - -### 2.1 提取 ProcessingOrchestrator (已完成 ✓) -- [x] 2.1.1 建立 `backend/app/services/processing_orchestrator.py` -- [x] 2.1.2 從 OCRService 提取流程編排邏輯 -- [x] 2.1.3 定義 `ProcessingPipeline` 介面 -- [x] 2.1.4 實現 DirectPipeline 和 OCRPipeline -- [x] 2.1.5 更新 OCRService 使用 ProcessingOrchestrator -- [x] 2.1.6 確保現有功能不受影響 - -### 2.2 提取 TableRenderer (已完成 ✓) -- [x] 2.2.1 建立 `backend/app/services/pdf_table_renderer.py` -- [x] 2.2.2 從 PDFGeneratorService 提取 HTMLTableParser -- [x] 2.2.3 提取表格渲染邏輯到獨立類 -- [x] 2.2.4 支援合併單元格渲染 -- [x] 2.2.5 提供多種渲染模式 (HTML, cell_boxes, cells_dict, translated) - -### 2.3 提取 FontManager (已完成 ✓) -- [x] 2.3.1 建立 `backend/app/services/pdf_font_manager.py` -- [x] 2.3.2 提取字體載入和快取邏輯 -- [x] 2.3.3 提取 CJK 字體支援邏輯 -- [x] 2.3.4 實現字體 fallback 機制 -- [x] 2.3.5 Singleton 模式避免重複註冊 - -## Phase 3: 記憶體管理簡化 (已完成) - -### 3.1 統一記憶體策略引擎 (已完成 ✓) -- [x] 3.1.1 建立 `backend/app/services/memory_policy_engine.py` -- [x] 3.1.2 定義統一的記憶體策略介面 (MemoryPolicyEngine) -- [x] 3.1.3 合併 MemoryManager 和 MemoryGuard 邏輯 (GPUMemoryMonitor + ModelManager) -- [x] 3.1.4 整合 Semaphore 管理 (PredictionSemaphore) -- [x] 3.1.5 簡化配置到 7 個核心項目 (MemoryPolicyConfig) -- [x] 3.1.6 移除未使用的類:BatchProcessor, ProgressiveLoader, PriorityOperationQueue, RecoveryManager, MemoryDumper, PrometheusMetrics -- [x] 3.1.7 代碼量從 ~2270 行減少到 ~600 行 (73% 減少) - -### 3.2 更新服務使用新記憶體引擎 (已完成 ✓) -- [x] 3.2.1 更新 OCRService 使用 MemoryPolicyEngine -- [x] 3.2.2 更新 ServicePool 使用 MemoryPolicyEngine -- [x] 3.2.3 保留舊的 MemoryGuard 作為 fallback (向後相容) -- [x] 3.2.4 驗證 GPU 記憶體監控正常運作 - -## Phase 4: 前端狀態管理改進 - -### 4.1 新增 TaskStore (已完成 ✓) -- [x] 4.1.1 建立 `frontend/src/store/taskStore.ts` -- [x] 4.1.2 定義任務狀態結構(currentTaskId, recentTasks, processingState) -- [x] 4.1.3 實現 CRUD 操作和狀態轉換(setCurrentTask, updateTaskCache, updateTaskStatus) -- [x] 4.1.4 添加 localStorage 持久化(使用 zustand persist middleware) -- [x] 4.1.5 更新 ProcessingPage 使用 TaskStore(startProcessing, stopProcessing) -- [x] 4.1.6 更新 TaskDetailPage 使用 TaskStore(updateTaskCache) - -### 4.2 合併類型定義 (已完成 ✓) -- [x] 4.2.1 審查 `api.ts` 和 `apiV2.ts` 的差異 -- [x] 4.2.2 合併共用類型定義到 `apiV2.ts`(LoginRequest, User, FileInfo, FileResult, ExportRule 等) -- [x] 4.2.3 保留 `api.ts` 用於 V1 特定類型(BatchStatus, ProcessRequest 等) -- [x] 4.2.4 更新所有 import 路徑(authStore, uploadStore, ResultsTable, SettingsPage, apiV2 service) -- [x] 4.2.5 驗證 TypeScript 編譯無錯誤 ✓ - -## Phase 5: 測試與驗證 (Direct Track 已完成) - -### 5.1 回歸測試 (Direct Track ✓) -- [x] 5.1.1 使用 edit.pdf 測試 Direct Track(3 頁, 51 元素, 1 表格 12 cells)✓ -- [x] 5.1.2 使用 edit3.pdf 測試 Direct Track 表格合併(2 頁, 43 cells, 12 merged)✓ -- [ ] 5.1.3 使用 edit.pdf 測試 OCR Track 圖片放回(需 GPU 環境) -- [ ] 5.1.4 使用 edit3.pdf 測試 OCR Track 圖片放回(需 GPU 環境) -- [x] 5.1.5 驗證所有 cell_boxes 座標正確(43 valid, 0 invalid)✓ - -### 5.2 效能測試 (Direct Track ✓) -- [x] 5.2.1 測量重構後的處理時間(edit3: 0.203s, edit: 1.281s)✓ -- [ ] 5.2.2 驗證記憶體使用無明顯增加(需 GPU 環境) -- [ ] 5.2.3 驗證 GPU 使用率正常(需 GPU 環境) diff --git a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/design.md b/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/design.md deleted file mode 100644 index fcab93b..0000000 --- a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/design.md +++ /dev/null @@ -1,227 +0,0 @@ -# Design: OCR Processing Presets - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ Frontend │ -├─────────────────────────────────────────────────────────────────┤ -│ ┌──────────────────┐ ┌──────────────────────────────────┐ │ -│ │ Preset Selector │───▶│ Advanced Parameter Panel │ │ -│ │ (Simple Mode) │ │ (Expert Mode) │ │ -│ └──────────────────┘ └──────────────────────────────────┘ │ -│ │ │ │ -│ └───────────┬───────────────┘ │ -│ ▼ │ -│ ┌─────────────────┐ │ -│ │ OCR Config JSON │ │ -│ └─────────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ - │ - ▼ POST /api/v2/tasks -┌─────────────────────────────────────────────────────────────────┐ -│ Backend │ -├─────────────────────────────────────────────────────────────────┤ -│ ┌──────────────────┐ ┌──────────────────────────────────┐ │ -│ │ Preset Resolver │───▶│ OCR Config Validator │ │ -│ └──────────────────┘ └──────────────────────────────────┘ │ -│ │ │ │ -│ └───────────┬───────────────┘ │ -│ ▼ │ -│ ┌─────────────────┐ │ -│ │ OCRService │ │ -│ │ (with config) │ │ -│ └─────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────┐ │ -│ │ PPStructureV3 │ │ -│ │ (configured) │ │ -│ └─────────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ -``` - -## Data Models - -### OCRPreset Enum - -```python -class OCRPreset(str, Enum): - TEXT_HEAVY = "text_heavy" # Reports, articles, manuals - DATASHEET = "datasheet" # Technical datasheets, TDS - TABLE_HEAVY = "table_heavy" # Financial reports, spreadsheets - FORM = "form" # Applications, surveys - MIXED = "mixed" # General documents - CUSTOM = "custom" # User-defined settings -``` - -### OCRConfig Model - -```python -class OCRConfig(BaseModel): - # Table Processing - table_parsing_mode: Literal["full", "conservative", "classification_only", "disabled"] = "conservative" - table_layout_threshold: float = Field(default=0.65, ge=0.0, le=1.0) - enable_wired_table: bool = True - enable_wireless_table: bool = False # Disabled by default (aggressive) - - # Layout Detection - layout_detection_model: Optional[str] = "PP-DocLayout_plus-L" - layout_threshold: Optional[float] = Field(default=None, ge=0.0, le=1.0) - layout_nms_threshold: Optional[float] = Field(default=None, ge=0.0, le=1.0) - layout_merge_mode: Optional[Literal["large", "small", "union"]] = "union" - - # Preprocessing - use_doc_orientation_classify: bool = True - use_doc_unwarping: bool = False # Causes distortion - use_textline_orientation: bool = True - - # Recognition Modules - enable_chart_recognition: bool = True - enable_formula_recognition: bool = True - enable_seal_recognition: bool = False - enable_region_detection: bool = True -``` - -### Preset Definitions - -```python -PRESET_CONFIGS: Dict[OCRPreset, OCRConfig] = { - OCRPreset.TEXT_HEAVY: OCRConfig( - table_parsing_mode="disabled", - table_layout_threshold=0.7, - enable_wired_table=False, - enable_wireless_table=False, - enable_chart_recognition=False, - enable_formula_recognition=False, - ), - OCRPreset.DATASHEET: OCRConfig( - table_parsing_mode="conservative", - table_layout_threshold=0.65, - enable_wired_table=True, - enable_wireless_table=False, # Key: disable aggressive wireless - ), - OCRPreset.TABLE_HEAVY: OCRConfig( - table_parsing_mode="full", - table_layout_threshold=0.5, - enable_wired_table=True, - enable_wireless_table=True, - ), - OCRPreset.FORM: OCRConfig( - table_parsing_mode="conservative", - table_layout_threshold=0.6, - enable_wired_table=True, - enable_wireless_table=False, - ), - OCRPreset.MIXED: OCRConfig( - table_parsing_mode="classification_only", - table_layout_threshold=0.55, - ), -} -``` - -## API Design - -### Task Creation with OCR Config - -```http -POST /api/v2/tasks -Content-Type: multipart/form-data - -file: -processing_track: "ocr" -ocr_preset: "datasheet" # Optional: use preset -ocr_config: { # Optional: override specific params - "table_layout_threshold": 0.7 -} -``` - -### Get Available Presets - -```http -GET /api/v2/ocr/presets - -Response: -{ - "presets": [ - { - "name": "datasheet", - "display_name": "Technical Datasheet", - "description": "Optimized for product specifications and technical documents", - "icon": "description", - "config": { ... } - }, - ... - ] -} -``` - -## Frontend Components - -### PresetSelector Component - -```tsx -interface PresetSelectorProps { - value: OCRPreset; - onChange: (preset: OCRPreset) => void; - showAdvanced: boolean; - onToggleAdvanced: () => void; -} - -// Visual preset cards with icons: -// 📄 Text Heavy - Reports & Articles -// 📊 Datasheet - Technical Documents -// 📈 Table Heavy - Financial Reports -// 📝 Form - Applications & Surveys -// 📑 Mixed - General Documents -// ⚙️ Custom - Expert Settings -``` - -### AdvancedConfigPanel Component - -```tsx -interface AdvancedConfigPanelProps { - config: OCRConfig; - onChange: (config: Partial) => void; - preset: OCRPreset; // To show which values differ from preset -} - -// Sections: -// - Table Processing (collapsed by default) -// - Layout Detection (collapsed by default) -// - Preprocessing (collapsed by default) -// - Recognition Modules (collapsed by default) -``` - -## Key Design Decisions - -### 1. Preset as Default, Custom as Exception - -Users should start with presets. Only expose advanced panel when: -- User explicitly clicks "Advanced Settings" -- User selects "Custom" preset -- User has previously saved custom settings - -### 2. Conservative Defaults - -All presets default to conservative settings: -- `enable_wireless_table: false` (most aggressive, causes cell explosion) -- `table_layout_threshold: 0.6+` (reduce false table detection) -- `use_doc_unwarping: false` (causes distortion) - -### 3. Config Inheritance - -Custom config inherits from preset, only specified fields override: -```python -final_config = PRESET_CONFIGS[preset].copy() -final_config.update(custom_overrides) -``` - -### 4. No Patch Behaviors - -All post-processing patches are disabled by default: -- `cell_validation_enabled: false` -- `gap_filling_enabled: false` -- `table_content_rebuilder_enabled: false` - -Focus on getting PP-Structure output right with proper configuration. diff --git a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/proposal.md b/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/proposal.md deleted file mode 100644 index d8bd7d4..0000000 --- a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/proposal.md +++ /dev/null @@ -1,116 +0,0 @@ -# Proposal: Add OCR Processing Presets and Parameter Configuration - -## Summary - -Add frontend UI for configuring PP-Structure OCR processing parameters with document-type presets and advanced parameter tuning. This addresses the root cause of table over-detection by allowing users to select appropriate processing modes for their document types. - -## Problem Statement - -Currently, PP-Structure's table parsing is too aggressive for many document types: -1. **Layout detection** misclassifies structured text (e.g., datasheet right columns) as tables -2. **Table cell parsing** over-segments these regions, causing "cell explosion" -3. **Post-processing patches** (cell validation, gap filling, table rebuilder) try to fix symptoms but don't address root cause -4. **No user control** - all settings are hardcoded in backend config.py - -## Proposed Solution - -### 1. Document Type Presets (Simple Mode) - -Provide predefined configurations for common document types: - -| Preset | Description | Table Parsing | Layout Threshold | Use Case | -|--------|-------------|---------------|------------------|----------| -| `text_heavy` | Documents with mostly paragraphs | disabled | 0.7 | Reports, articles, manuals | -| `datasheet` | Technical datasheets with tables/specs | conservative | 0.65 | Product specs, TDS | -| `table_heavy` | Documents with many tables | full | 0.5 | Financial reports, spreadsheets | -| `form` | Forms with fields | conservative | 0.6 | Applications, surveys | -| `mixed` | Mixed content documents | classification_only | 0.55 | General documents | -| `custom` | User-defined settings | user-defined | user-defined | Advanced users | - -### 2. Advanced Parameter Panel (Expert Mode) - -Expose all PP-Structure parameters for fine-tuning: - -**Table Processing:** -- `table_parsing_mode`: full / conservative / classification_only / disabled -- `table_layout_threshold`: 0.0 - 1.0 (higher = stricter table detection) -- `enable_wired_table`: true / false -- `enable_wireless_table`: true / false -- `wired_table_model`: model selection -- `wireless_table_model`: model selection - -**Layout Detection:** -- `layout_detection_model`: model selection -- `layout_threshold`: 0.0 - 1.0 -- `layout_nms_threshold`: 0.0 - 1.0 -- `layout_merge_mode`: large / small / union - -**Preprocessing:** -- `use_doc_orientation_classify`: true / false -- `use_doc_unwarping`: true / false -- `use_textline_orientation`: true / false - -**Other Recognition:** -- `enable_chart_recognition`: true / false -- `enable_formula_recognition`: true / false -- `enable_seal_recognition`: true / false - -### 3. API Endpoint - -Add endpoint to accept processing configuration: - -``` -POST /api/v2/tasks -{ - "file": ..., - "processing_track": "ocr", - "ocr_preset": "datasheet", // OR - "ocr_config": { - "table_parsing_mode": "conservative", - "table_layout_threshold": 0.65, - ... - } -} -``` - -### 4. Frontend UI Components - -1. **Preset Selector**: Dropdown with document type icons and descriptions -2. **Advanced Toggle**: Expand/collapse for parameter panel -3. **Parameter Groups**: Collapsible sections for table/layout/preprocessing -4. **Real-time Preview**: Show expected behavior based on settings - -## Benefits - -1. **Root cause fix**: Address table over-detection at the source -2. **User empowerment**: Users can optimize for their specific documents -3. **No patches needed**: Clean PP-Structure output without post-processing hacks -4. **Iterative improvement**: Users can fine-tune and share working configurations - -## Scope - -- Backend: API endpoint, preset definitions, parameter validation -- Frontend: UI components for preset selection and parameter tuning -- No changes to PP-Structure core - only configuration - -## Success Criteria - -1. Users can select appropriate preset for document type -2. OCR output matches document reality without post-processing patches -3. Advanced users can fine-tune all PP-Structure parameters -4. Configuration can be saved and reused - -## Risks & Mitigations - -| Risk | Mitigation | -|------|------------| -| Users overwhelmed by parameters | Default to presets, hide advanced panel | -| Wrong preset selection | Provide visual examples for each preset | -| Breaking changes | Keep backward compatibility with defaults | - -## Timeline - -Phase 1: Backend API and presets (2-3 days) -Phase 2: Frontend preset selector (1-2 days) -Phase 3: Advanced parameter panel (2-3 days) -Phase 4: Documentation and testing (1 day) diff --git a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/specs/ocr-processing/spec.md deleted file mode 100644 index eda8b3c..0000000 --- a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/specs/ocr-processing/spec.md +++ /dev/null @@ -1,96 +0,0 @@ -# OCR Processing - Delta Spec - -## ADDED Requirements - -### Requirement: REQ-OCR-PRESETS - Document Type Presets - -The system MUST provide predefined OCR processing configurations for common document types. - -Available presets: -- `text_heavy`: Optimized for text-heavy documents (reports, articles) -- `datasheet`: Optimized for technical datasheets -- `table_heavy`: Optimized for documents with many tables -- `form`: Optimized for forms and applications -- `mixed`: Balanced configuration for mixed content -- `custom`: User-defined configuration - -#### Scenario: User selects datasheet preset -- Given a user uploading a technical datasheet -- When they select the "datasheet" preset -- Then the system applies conservative table parsing mode -- And disables wireless table detection -- And sets layout threshold to 0.65 - -#### Scenario: User selects text_heavy preset -- Given a user uploading a text-heavy report -- When they select the "text_heavy" preset -- Then the system disables table recognition -- And focuses on text extraction - -### Requirement: REQ-OCR-PARAMS - Advanced Parameter Configuration - -The system MUST allow advanced users to configure individual PP-Structure parameters. - -Configurable parameters include: -- Table parsing mode (full/conservative/classification_only/disabled) -- Table layout threshold (0.0-1.0) -- Wired/wireless table detection toggles -- Layout detection model selection -- Preprocessing options (orientation, unwarping, textline) -- Recognition module toggles (chart, formula, seal) - -#### Scenario: User adjusts table layout threshold -- Given a user experiencing table over-detection -- When they increase table_layout_threshold to 0.7 -- Then fewer regions are classified as tables -- And text regions are preserved correctly - -#### Scenario: User disables wireless table detection -- Given a user processing a datasheet with cell explosion -- When they disable enable_wireless_table -- Then only bordered tables are detected -- And structured text is not split into cells - -### Requirement: REQ-OCR-API - OCR Configuration API - -The task creation API MUST accept OCR configuration parameters. - -API accepts: -- `ocr_preset`: Preset name to apply -- `ocr_config`: Custom configuration object (overrides preset) - -#### Scenario: Create task with preset -- Given an API request with ocr_preset="datasheet" -- When the task is created -- Then the datasheet preset configuration is applied -- And the task processes with conservative table parsing - -#### Scenario: Create task with custom config -- Given an API request with ocr_config containing custom values -- When the task is created -- Then the custom configuration overrides defaults -- And the task uses the specified parameters - -## MODIFIED Requirements - -### Requirement: REQ-OCR-DEFAULTS - Default Processing Configuration - -The system default configuration MUST be conservative to prevent over-detection. - -Default values: -- `table_parsing_mode`: "conservative" -- `table_layout_threshold`: 0.65 -- `enable_wireless_table`: false -- `use_doc_unwarping`: false - -Patch behaviors MUST be disabled by default: -- `cell_validation_enabled`: false -- `gap_filling_enabled`: false -- `table_content_rebuilder_enabled`: false - -#### Scenario: New task uses conservative defaults -- Given a task created without specifying OCR configuration -- When the task is processed -- Then conservative table parsing is used -- And wireless table detection is disabled -- And no post-processing patches are applied diff --git a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/tasks.md b/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/tasks.md deleted file mode 100644 index 535693d..0000000 --- a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/tasks.md +++ /dev/null @@ -1,75 +0,0 @@ -# Tasks: Add OCR Processing Presets - -## Phase 1: Backend API and Presets - -- [x] Define preset configurations as Pydantic models - - [x] Create `OCRPreset` enum with preset names - - [x] Create `OCRConfig` model with all configurable parameters - - [x] Define preset mappings (preset name -> config values) - -- [x] Update task creation API - - [x] Add `ocr_preset` optional parameter - - [x] Add `ocr_config` optional parameter for custom settings - - [x] Validate preset/config combinations - - [x] Apply configuration to OCR service - -- [x] Implement preset configuration loader - - [x] Load preset from enum name - - [x] Merge custom config with preset defaults - - [x] Validate parameter ranges - -- [x] Remove/disable patch behaviors (already done) - - [x] Disable cell_validation_enabled (default=False) - - [x] Disable gap_filling_enabled (default=False) - - [x] Disable table_content_rebuilder_enabled (default=False) - -## Phase 2: Frontend Preset Selector - -- [x] Create preset selection component - - [x] Card selector with document type icons - - [x] Preset description and use case tooltips - - [x] Visual preview of expected behavior (info box) - -- [x] Integrate with processing flow - - [x] Add preset selection to ProcessingPage - - [x] Pass selected preset to API - - [x] Default to 'datasheet' preset - -- [x] Add preset management - - [x] List available presets in grid layout - - [x] Show recommended preset (datasheet) - - [x] Allow preset change before processing - -## Phase 3: Advanced Parameter Panel - -- [x] Create parameter configuration component - - [x] Collapsible "Advanced Settings" section - - [x] Group parameters by category (Table, Layout, Preprocessing) - - [x] Input controls for each parameter type - -- [x] Implement parameter validation - - [x] Client-side input validation - - [x] Disabled state when preset != custom - - [x] Reset hint when not in custom mode - -- [x] Add parameter tooltips - - [x] Chinese labels for all parameters - - [x] Help text for custom mode - - [x] Info box with usage notes - -## Phase 4: Documentation and Testing - -- [x] Create user documentation - - [x] Preset selection guide - - [x] Parameter reference - - [x] Troubleshooting common issues - -- [x] Add API documentation - - [x] OpenAPI spec auto-generated by FastAPI - - [x] Pydantic models provide schema documentation - - [x] Field descriptions in OCRConfig - -- [x] Test with various document types - - [x] Verify datasheet processing with conservative mode (see test-notes.md; execution pending on target runtime) - - [x] Verify table-heavy documents with full mode (see test-notes.md; execution pending on target runtime) - - [x] Verify text documents with disabled mode (see test-notes.md; execution pending on target runtime) diff --git a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/test-notes.md b/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/test-notes.md deleted file mode 100644 index 00579f3..0000000 --- a/openspec/changes/archive/2025-12-10-add-ocr-processing-presets/test-notes.md +++ /dev/null @@ -1,14 +0,0 @@ -# Test Notes – Add OCR Processing Presets - -Status: Manual execution not run in this environment (Paddle models/GPU not available here). Scenarios and expected outcomes are documented for follow-up verification on a prepared runtime. - -| Scenario | Input | Preset / Config | Expected | Status | -| --- | --- | --- | --- | --- | -| Datasheet,保守解析 | `demo_docs/edit3.pdf` | `ocr_preset=datasheet` (conservative, wireless off) | Tables detected without over-segmentation; layout intact | Pending (run on target runtime) | -| 表格密集 | `demo_docs/edit2.pdf` 或財報樣本 | `ocr_preset=table_heavy` (full, wireless on) | All tables detected, merged cells保持;無明顯漏檢 | Pending (run on target runtime) | -| 純文字 | `demo_docs/scan.pdf` | `ocr_preset=text_heavy` (table disabled, charts/formula off) | 只輸出文字區塊;無表格/圖表元素 | Pending (run on target runtime) | - -Suggested validation steps: -1) 透過前端選擇對應預設並啟動處理;或以 API 送出 `ocr_preset`/`ocr_config`。 -2) 確認結果 JSON/Markdown 與預期行為一致(表格數量、元素類型、是否過度拆分)。 -3) 若需要調整,切換至 `custom` 並覆寫 `table_parsing_mode`、`enable_wireless_table` 或 `layout_threshold`,再重試。 diff --git a/openspec/changes/archive/2025-12-11-cleanup-dead-code/proposal.md b/openspec/changes/archive/2025-12-11-cleanup-dead-code/proposal.md deleted file mode 100644 index 411b28b..0000000 --- a/openspec/changes/archive/2025-12-11-cleanup-dead-code/proposal.md +++ /dev/null @@ -1,175 +0,0 @@ -# Change: Cleanup Dead Code and Improve Code Quality - -## Why - -深度代碼盤點發現專案中存在以下問題: -1. 已廢棄但未刪除的服務文件(507行) -2. 過時的配置項(已標記 deprecated 但未移除) -3. 重複的 bbox 處理邏輯散落在 4 個文件中 -4. 未使用的 imports 和類型斷言問題 -5. 多個 TODO 標記需要處理或移除 -6. **Paddle/PP-Structure 相關的禁用功能和補丁代碼** - -本提案旨在系統性清理這些垃圾代碼,提升代碼質量和可維護性。 - -## What Changes - -### Phase 1: 刪除廢棄文件 (高優先級) - -| 文件 | 行數 | 原因 | -|------|------|------| -| `backend/app/services/pdf_generator.py` | 507 | 已被 `pdf_generator_service.py` 完全替代,無任何引用 | - -### Phase 2: 移除過時配置 (高優先級) - -| 文件 | 配置項 | 原因 | -|------|--------|------| -| `backend/app/core/config.py` | `gap_filling_iou_threshold` | 已過時,應使用 IoA 閾值 | -| `backend/app/core/config.py` | `gap_filling_dedup_iou_threshold` | 已過時,應使用 `gap_filling_dedup_ioa_threshold` | - -### Phase 3: 提取共用 bbox 工具函數 (中優先級) - -創建 `backend/app/utils/bbox_utils.py`,統一以下位置的重複邏輯: - -| 文件 | 函數 | 行號 | -|------|------|------| -| `gap_filling_service.py` | `normalized_bbox` property | L51 | -| `pdf_generator_service.py` | `_get_bbox_coords` | L1859 | -| `pp_structure_debug.py` | `_normalize_bbox` | L240 | -| `text_region_renderer.py` | `get_bbox_as_rect` | L162 | - -### Phase 4: 前端代碼清理 (低優先級) - -| 文件 | 問題 | 行號 | -|------|------|------| -| `ExportPage.tsx` | 未使用的 `CardDescription` import | L5 | -| `UploadPage.tsx` | `as any` 類型斷言 + TODO | L32-34 | -| `TaskHistoryPage.tsx` | `as any` 類型斷言 | L337 | -| `useTaskValidation.ts` | `as any` 類型斷言 | L61 | - -### Phase 5: 清理禁用的表格補丁功能 (中優先級) - -以下功能是針對 PP-Structure 輸出缺陷的「補丁行為」,已禁用且不應再使用: - -| 服務文件 | 配置項 | 狀態 | 說明 | 建議 | -|----------|--------|------|------|------| -| `cell_validation_engine.py` | `cell_validation_enabled` | False | 過濾過度檢測的表格單元格 | **可刪除** - 應改進 PP-Structure 而非補丁 | -| `table_content_rebuilder.py` | `table_content_rebuilder_enabled` | False | 從 Raw OCR 重建表格 HTML | **可刪除** - 補丁行為 | -| - | `table_quality_check_enabled` | False | 單元格框質量檢查 | **移除配置** - 未完全實現 | -| - | `table_rendering_prefer_cellboxes` | False | 算法需改進 | **移除配置** - 算法有誤 | - -### Phase 6: 評估 PP-Structure 模型使用 (需討論) - -#### 當前使用的模型 (11個) - -**必需模型 (3個) - 核心 OCR 功能** -| 模型 | 用途 | 狀態 | -|------|------|------| -| `PP-DocLayout_plus-L` | 佈局檢測 | **必需** | -| `PP-OCRv5_server_det` | 文本檢測 | **必需** | -| `PP-OCRv5_server_rec` | 文本識別 | **必需** | - -**表格相關模型 (5個) - 可選但啟用** -| 模型 | 用途 | 狀態 | 記憶體 | -|------|------|------|--------| -| `SLANeXt_wired` | 有邊框表格結構識別 | 啟用 | ~350MB | -| `SLANeXt_wireless` | 無邊框表格結構識別 | **保守模式下禁用** | ~350MB | -| `PP-LCNet_x1_0_table_cls` | 表格分類 | 啟用 | ~50MB | -| `RT-DETR-L_wired_table_cell_det` | 有邊框單元格檢測 | 啟用 | 共享 | -| `RT-DETR-L_wireless_table_cell_det` | 無邊框單元格檢測 | **保守模式下禁用** | 共享 | - -**增強功能模型 (2個) - 可選** -| 模型 | 用途 | 狀態 | 是否需要 | -|------|------|------|----------| -| `PP-FormulaNet_plus-L` | 公式轉 LaTeX | 啟用 | 視需求,可禁用節省 ~300MB | -| `PP-Chart2Table` | 圖表轉表格 | 啟用 | 視需求,可禁用節省 ~200MB | - -**預處理模型 (3個)** -| 模型 | 用途 | 狀態 | 建議 | -|------|------|------|------| -| `PP-LCNet_x1_0_doc_ori` | 文檔方向檢測 | 啟用 | 保留 | -| `PP-LCNet_x1_0_textline_ori` | 文本行方向檢測 | 啟用 | 保留 | -| `UVDoc` | 文檔變形修正 | **禁用** | **可移除配置** - 會導致文檔失真 | - -#### 禁用的 Gap Filling 功能 - -| 配置項 | 狀態 | 相關代碼 | 建議 | -|--------|------|----------|------| -| `gap_filling_enabled` | False | `gap_filling_service.py` | 保留代碼,作為可選增強 | -| `gap_filling_iou_threshold` | 過時 | config.py | **刪除** - 已被 IoA 閾值取代 | -| `gap_filling_dedup_iou_threshold` | 過時 | config.py | **刪除** - 已被 IoA 閾值取代 | - -## Impact - -- **Affected specs**: 無(純代碼清理,不改變系統行為) -- **Affected code**: - - Backend: 刪除 1-3 個文件,修改 config.py,創建 bbox_utils.py - - Frontend: 修改 4 個文件(類型改進) -- **記憶體影響**: 如移除無邊框表格模型,可節省 ~700MB GPU 記憶體 - -## Benefits - -- 減少約 **600-1,500 行**冗餘代碼(視 Phase 5-6 範圍) -- 統一 bbox 處理邏輯,減少重複代碼 **80-100 行** -- 提升 TypeScript 類型安全性 -- 移除過時配置和補丁代碼,減少維護負擔 -- 精簡 PP-Structure 模型配置,提升可讀性 - -## Risk Assessment - -- **風險等級**: 低-中 -- **Phase 1-2**: 無風險(刪除未使用的代碼) -- **Phase 3**: 低風險(重構,需要測試) -- **Phase 4**: 低風險(類型改進) -- **Phase 5**: 低風險(刪除禁用的補丁代碼) -- **Phase 6**: 中風險(需評估模型是否還需要) -- **回滾策略**: Git revert - -## Paddle/PP-Structure 使用情況摘要 - -### 直接使用 Paddle 的文件 (僅 3 個) - -| 文件 | 行數 | 功能 | -|------|------|------| -| `ocr_service.py` | ~2,590 | OCR 引擎管理、GPU 配置、模型卸載 | -| `pp_structure_enhanced.py` | ~1,324 | PP-StructureV3 結果解析、元素提取 | -| `memory_manager.py` | ~2,269 | GPU 記憶體監控、多後端支持 | - -### 表格解析模式 (table_parsing_mode) - -| 模式 | 說明 | 適用場景 | -|------|------|----------| -| `full` | 激進,完整表格檢測 | 表格密集的文檔 | -| `conservative` | **當前使用**,禁用無邊框表格 | 混合文檔 | -| `classification_only` | 僅識別表格區域,無結構解析 | 數據表/電子表格 | -| `disabled` | 完全禁用表格識別 | 純文本文檔 | - -### 補丁 vs 核心功能分類 - -``` -┌─────────────────────────────────────────────────────────────┐ -│ 核心功能 (必須保留) │ -├─────────────────────────────────────────────────────────────┤ -│ • PaddleOCR 文本識別 │ -│ • PP-DocLayout 佈局檢測 │ -│ • SLANeXt 表格結構識別 │ -│ • 記憶體管理和自動卸載 │ -└─────────────────────────────────────────────────────────────┘ - -┌─────────────────────────────────────────────────────────────┐ -│ 補丁功能 (建議移除) │ -├─────────────────────────────────────────────────────────────┤ -│ • cell_validation_engine.py - 過度檢測過濾 │ -│ • table_content_rebuilder.py - 表格內容重建 │ -│ • table_quality_check - 未完全實現 │ -│ • table_rendering_prefer_cellboxes - 算法有誤 │ -└─────────────────────────────────────────────────────────────┘ - -┌─────────────────────────────────────────────────────────────┐ -│ 可選增強 (保留代碼,按需啟用) │ -├─────────────────────────────────────────────────────────────┤ -│ • gap_filling_service.py - OCR 補充遺漏區域 │ -│ • PP-FormulaNet - 公式識別 │ -│ • PP-Chart2Table - 圖表識別 │ -└─────────────────────────────────────────────────────────────┘ -``` diff --git a/openspec/changes/archive/2025-12-11-cleanup-dead-code/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-11-cleanup-dead-code/specs/document-processing/spec.md deleted file mode 100644 index 2d467ce..0000000 --- a/openspec/changes/archive/2025-12-11-cleanup-dead-code/specs/document-processing/spec.md +++ /dev/null @@ -1,42 +0,0 @@ -## REMOVED Requirements - -### Requirement: Legacy PDF Generator Service - -**Reason**: `pdf_generator.py` (507 lines) was the original PDF generation implementation using Pandoc/WeasyPrint. It has been completely superseded by `pdf_generator_service.py` which uses ReportLab for low-level PDF generation with full layout preservation, table rendering, and image support. - -**Migration**: No migration needed. The new `pdf_generator_service.py` provides all functionality with improved features. - -#### Scenario: Legacy PDF generator file removal -- **WHEN** the legacy `pdf_generator.py` file is removed -- **THEN** the system continues to function normally using `pdf_generator_service.py` -- **AND** PDF generation works correctly with layout preservation -- **AND** no import errors occur in any service or router - -### Requirement: Deprecated IoU Configuration Parameters - -**Reason**: `gap_filling_iou_threshold` and `gap_filling_dedup_iou_threshold` are deprecated configuration parameters that should be replaced by IoA (Intersection over Area) thresholds for better accuracy. - -**Migration**: Use `gap_filling_dedup_ioa_threshold` instead. - -#### Scenario: Deprecated config removal -- **WHEN** the deprecated IoU configuration parameters are removed from config.py -- **THEN** gap filling service uses IoA-based thresholds -- **AND** the system starts without configuration errors - -## ADDED Requirements - -### Requirement: Unified Bbox Utility Module - -The system SHALL provide a centralized bbox utility module (`backend/app/utils/bbox_utils.py`) for consistent bounding box normalization across all services. - -#### Scenario: Bbox normalization from polygon format -- **WHEN** a bbox in polygon format `[[x1,y1], [x2,y2], [x3,y3], [x4,y4]]` is provided -- **THEN** the utility returns normalized tuple `(x0, y0, x1, y1)` representing min/max coordinates - -#### Scenario: Bbox normalization from flat array -- **WHEN** a bbox in flat array format `[x0, y0, x1, y1]` is provided -- **THEN** the utility returns normalized tuple `(x0, y0, x1, y1)` - -#### Scenario: Bbox normalization from 8-point polygon -- **WHEN** a bbox in 8-point format `[x1, y1, x2, y2, x3, y3, x4, y4]` is provided -- **THEN** the utility calculates and returns normalized tuple `(min_x, min_y, max_x, max_y)` diff --git a/openspec/changes/archive/2025-12-11-cleanup-dead-code/tasks.md b/openspec/changes/archive/2025-12-11-cleanup-dead-code/tasks.md deleted file mode 100644 index 0c1a307..0000000 --- a/openspec/changes/archive/2025-12-11-cleanup-dead-code/tasks.md +++ /dev/null @@ -1,92 +0,0 @@ -# Tasks: Cleanup Dead Code and Improve Code Quality - -## Phase 1: 刪除廢棄文件 (高優先級, ~30分鐘) - -- [x] 1.1 確認 `pdf_generator.py` 無任何引用 -- [x] 1.2 刪除 `backend/app/services/pdf_generator.py` -- [x] 1.3 驗證後端啟動正常 - -## Phase 2: 移除過時配置 (高優先級, ~15分鐘) - -- [x] 2.1 移除 `config.py` 中的 `gap_filling_iou_threshold` -- [x] 2.2 移除 `config.py` 中的 `gap_filling_dedup_iou_threshold` -- [x] 2.3 搜索並更新任何使用這些配置的代碼 -- [x] 2.4 驗證後端啟動正常 - -## Phase 3: 提取共用 bbox 工具函數 (中優先級, ~2小時) - -- [x] 3.1 創建 `backend/app/utils/__init__.py`(如不存在) -- [x] 3.2 創建 `backend/app/utils/bbox_utils.py`,實現統一的 bbox 處理函數 -- [x] 3.3 重構 `gap_filling_service.py` 使用共用函數 -- [x] 3.4 重構 `pdf_generator_service.py` 使用共用函數 -- [x] 3.5 重構 `pp_structure_debug.py` 使用共用函數 -- [x] 3.6 重構 `text_region_renderer.py` 使用共用函數 -- [x] 3.7 測試所有相關功能正常 - -## Phase 4: 前端代碼清理 (低優先級, ~1小時) - -- [x] 4.1 移除 `ExportPage.tsx` 中未使用的 `CardDescription` import (SKIPPED - actually used) -- [x] 4.2 重構 `UploadPage.tsx` 的 `as any` 類型斷言 (improved to `as unknown as number`) -- [x] 4.3 處理或移除 `UploadPage.tsx` 中的 TODO 註釋 (comment improved) -- [x] 4.4 重構 `TaskHistoryPage.tsx` 的 `as any` 類型斷言 (changed to `as TaskStatus | 'all'`) -- [x] 4.5 重構 `useTaskValidation.ts` 的 `as any` 類型斷言 (using `instanceof AxiosError`) -- [x] 4.6 驗證前端編譯正常 (pre-existing errors not from our changes) - -## Phase 5: 清理禁用的表格補丁功能 (中優先級, ~1小時) - -- [x] 5.1 移除 `cell_validation_engine.py` 整個文件(已禁用的補丁功能) -- [x] 5.2 移除 `table_content_rebuilder.py` 整個文件(已禁用的補丁功能) -- [x] 5.3 移除 `config.py` 中的 `cell_validation_enabled` 配置 -- [x] 5.4 移除 `config.py` 中的 `table_content_rebuilder_enabled` 配置 -- [x] 5.5 移除 `config.py` 中的 `table_quality_check_enabled` 配置 -- [x] 5.6 移除 `config.py` 中的 `table_rendering_prefer_cellboxes` 配置 -- [x] 5.7 搜索並清理所有引用這些配置的代碼 -- [x] 5.8 驗證後端啟動正常 - -## Phase 6: 評估 PP-Structure 模型使用 (需討論, ~2小時) - -### 6.1 必需模型 (不可移除) -- [x] 6.1.1 確認 `PP-DocLayout_plus-L` 佈局檢測使用中 -- [x] 6.1.2 確認 `PP-OCRv5_server_det` 文本檢測使用中 -- [x] 6.1.3 確認 `PP-OCRv5_server_rec` 文本識別使用中 - -### 6.2 表格相關模型 (評估是否需要) -- [x] 6.2.1 評估 `SLANeXt_wired` 有邊框表格結構識別 (保留 - 核心功能) -- [x] 6.2.2 評估 `SLANeXt_wireless` 無邊框表格結構識別(保守模式下已禁用)(保留配置) -- [x] 6.2.3 評估 `PP-LCNet_x1_0_table_cls` 表格分類器 (保留 - 核心功能) -- [x] 6.2.4 評估 `RT-DETR-L_wired_table_cell_det` 有邊框單元格檢測 (保留 - 核心功能) -- [x] 6.2.5 評估 `RT-DETR-L_wireless_table_cell_det` 無邊框單元格檢測 (保守模式下已禁用) (保留配置) - -### 6.3 增強功能模型 (可選禁用) -- [x] 6.3.1 評估 `PP-FormulaNet_plus-L` 公式識別(~300MB)(保留 - 可選功能) -- [x] 6.3.2 評估 `PP-Chart2Table` 圖表識別(~200MB)(保留 - 可選功能) - -### 6.4 預處理模型 -- [x] 6.4.1 確認 `PP-LCNet_x1_0_doc_ori` 文檔方向檢測使用中 -- [x] 6.4.2 確認 `PP-LCNet_x1_0_textline_ori` 文本行方向檢測使用中 -- [x] 6.4.3 移除 `UVDoc` 文檔變形修正配置 (保留 - 已禁用但可選) - -### 6.5 清理 Gap Filling 過時配置 -- [x] 6.5.1 確認 `gap_filling_service.py` 代碼保留(可選增強功能) -- [x] 6.5.2 移除過時的 IoU 相關配置(Phase 2 已處理) - -## Verification - -- [x] 後端服務啟動正常 -- [x] 前端編譯正常 (pre-existing TypeScript errors not from our changes) -- [ ] OCR 處理功能正常(Direct Track + OCR Track)- 需手動測試 -- [ ] PDF 生成功能正常 - 需手動測試 -- [ ] 表格渲染功能正常(conservative 模式)- 需手動測試 -- [ ] GPU 記憶體使用正常 - 需手動測試 - -## Summary - -| Phase | 實際刪除行數 | 複雜度 | 說明 | -|-------|--------------|--------|------| -| Phase 1 | 507 | 低 | 刪除廢棄的 pdf_generator.py | -| Phase 2 | ~10 | 低 | 移除過時 IoU 配置及引用 | -| Phase 3 | ~80 (節省重複) | 中 | 提取共用 bbox 工具,新增 bbox_utils.py | -| Phase 4 | ~5 | 低 | 前端類型改進 | -| Phase 5 | ~1,450 | 中 | 清理禁用的補丁功能 (583+806+configs) | -| Phase 6 | 0 | 低 | 評估完成,保留模型配置 | -| **Total** | **~2,050** | - | - | diff --git a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/proposal.md b/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/proposal.md deleted file mode 100644 index 59a98d8..0000000 --- a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/proposal.md +++ /dev/null @@ -1,52 +0,0 @@ -# Enable Document Orientation Detection - -## Summary -Enable PP-StructureV3's document orientation classification feature to correctly handle PDF scans where the content orientation differs from the PDF page metadata. - -## Problem Statement -Currently, when a portrait-oriented PDF contains landscape-scanned content (or vice versa), the OCR system produces incorrect results because: - -1. **pdf2image** extracts images based on PDF metadata (e.g., `Page size: 1242 x 1755`, `Page rot: 0`) -2. **PP-StructureV3** has `use_doc_orientation_classify=False` (disabled) -3. The OCR attempts to read sideways text, resulting in poor recognition -4. The output PDF has wrong page dimensions - -### Example Scenario -- Input: Portrait PDF (1242 x 1755) containing landscape-scanned delivery form -- Current output: Portrait PDF with unreadable/incorrect text -- Expected output: Landscape PDF (1755 x 1242) with correctly oriented text - -## Proposed Solution -Enable document orientation detection in PP-StructureV3 and adjust page dimensions based on the detected rotation: - -1. **Enable orientation detection**: Set `use_doc_orientation_classify=True` in config -2. **Capture rotation info**: Extract the detected rotation angle (0°/90°/180°/270°) from PP-StructureV3 results -3. **Adjust dimensions**: When 90° or 270° rotation is detected, swap width and height for the output PDF -4. **Use OCR coordinates directly**: PP-StructureV3 returns coordinates based on the rotated image, so no coordinate transformation is needed - -## PP-StructureV3 Orientation Detection Details -According to PaddleOCR documentation: -- **Stage 1 preprocessing**: `use_doc_orientation_classify` detects and rotates the entire page -- **Output format**: `doc_preprocessor_res` contains: - - `class_ids`: [0-3] corresponding to [0°, 90°, 180°, 270°] - - `label_names`: ["0", "90", "180", "270"] - - `scores`: confidence scores -- **Model accuracy**: PP-LCNet_x1_0_doc_ori achieves 99.06% top-1 accuracy - -## Scope -- Backend only (no frontend changes required) -- Affects OCR track processing -- Does not affect Direct or Hybrid track - -## Risks and Mitigations -| Risk | Mitigation | -|------|------------| -| Model might incorrectly classify mixed-orientation pages | 99.06% accuracy is acceptable; `use_textline_orientation` (already enabled) handles per-line correction | -| Coordinate mismatch in edge cases | Thorough testing with portrait, landscape, and mixed documents | -| Performance overhead | Orientation classification adds ~100ms per page (negligible vs total OCR time) | - -## Success Criteria -1. Portrait PDF with landscape content produces landscape output PDF -2. Landscape PDF with portrait content produces portrait output PDF -3. Normal orientation documents continue to work correctly -4. Text recognition accuracy improves for rotated documents diff --git a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/specs/ocr-processing/spec.md deleted file mode 100644 index 7f3dbf0..0000000 --- a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/specs/ocr-processing/spec.md +++ /dev/null @@ -1,80 +0,0 @@ -# ocr-processing Specification Delta - -## ADDED Requirements - -### Requirement: Document Orientation Detection - -The system SHALL detect and correct document orientation for scanned PDFs where the content orientation differs from PDF page metadata. - -#### Scenario: Portrait PDF with landscape content is corrected -- **GIVEN** a PDF with portrait page dimensions (width < height) -- **AND** the scanned content is rotated 90° (landscape scan in portrait page) -- **WHEN** PP-StructureV3 processes the image with `use_doc_orientation_classify=True` -- **THEN** the system SHALL detect rotation angle as "90" or "270" -- **AND** the output PDF page dimensions SHALL be swapped (width ↔ height) -- **AND** all text elements SHALL be correctly positioned in the rotated coordinate space - -#### Scenario: Landscape PDF with portrait content is corrected -- **GIVEN** a PDF with landscape page dimensions (width > height) -- **AND** the scanned content is rotated 90° (portrait scan in landscape page) -- **WHEN** PP-StructureV3 processes the image -- **THEN** the system SHALL detect rotation angle as "90" or "270" -- **AND** the output PDF page dimensions SHALL be swapped -- **AND** all text elements SHALL be correctly positioned - -#### Scenario: Upside-down content is corrected -- **GIVEN** a scanned document that is upside down (180° rotation) -- **WHEN** PP-StructureV3 processes the image -- **THEN** the system SHALL detect rotation angle as "180" -- **AND** page dimensions SHALL NOT be swapped (orientation is same, just flipped) -- **AND** text elements SHALL be correctly positioned after internal rotation - -#### Scenario: Correctly oriented documents remain unchanged -- **GIVEN** a PDF where page metadata matches actual content orientation -- **WHEN** PP-StructureV3 processes the image -- **THEN** the system SHALL detect rotation angle as "0" -- **AND** page dimensions SHALL remain unchanged -- **AND** processing SHALL proceed normally without dimension adjustment - -#### Scenario: Rotation angle is captured from PP-StructureV3 results -- **GIVEN** PP-StructureV3 is configured with `use_doc_orientation_classify=True` -- **WHEN** processing completes -- **THEN** the system SHALL extract rotation angle from `doc_preprocessor_res.label_names` -- **AND** include `detected_rotation` in the OCR result metadata -- **AND** log the detected rotation for debugging - -#### Scenario: Dimension adjustment happens before PDF generation -- **GIVEN** OCR processing detects rotation angle of "90" or "270" -- **WHEN** creating the UnifiedDocument for PDF generation -- **THEN** the Page dimensions SHALL use adjusted (swapped) width and height -- **AND** OCR coordinates SHALL be used directly (already in rotated space) -- **AND** no additional coordinate transformation is needed - -### Requirement: Orientation Detection Configuration - -The system SHALL provide configuration for enabling/disabling document orientation detection. - -#### Scenario: Orientation detection is enabled by default -- **GIVEN** default configuration settings -- **WHEN** OCR track processing runs -- **THEN** `use_doc_orientation_classify` SHALL be `True` -- **AND** PP-StructureV3 SHALL perform document orientation classification - -#### Scenario: Orientation detection can be disabled -- **GIVEN** `use_doc_orientation_classify` is set to `False` in configuration -- **WHEN** OCR track processing runs -- **THEN** the system SHALL NOT perform orientation detection -- **AND** page dimensions SHALL be based on original image dimensions -- **AND** this maintains backward compatibility for controlled environments - -## MODIFIED Requirements - -### Requirement: Layout Model Selection (Modified) - -The system SHALL apply document orientation detection before layout detection regardless of the selected layout model. - -#### Scenario: Orientation detection works with all layout models -- **GIVEN** a user selects any layout model (chinese, default, cdla) -- **WHEN** OCR processing runs with `use_doc_orientation_classify=True` -- **THEN** orientation detection SHALL be applied regardless of layout model choice -- **AND** orientation detection happens in Stage 1 (preprocessing) before layout detection (Stage 3) diff --git a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/tasks.md b/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/tasks.md deleted file mode 100644 index f355817..0000000 --- a/openspec/changes/archive/2025-12-11-enable-doc-orientation-detection/tasks.md +++ /dev/null @@ -1,71 +0,0 @@ -# Tasks - -## Phase 1: Enable Orientation Detection - -- [x] **Task 1.1**: Enable `use_doc_orientation_classify` in config - - File: `backend/app/core/config.py` - - Change: Set `use_doc_orientation_classify: bool = Field(default=True)` - - Update comment to reflect new behavior - -- [x] **Task 1.2**: Capture rotation info from PP-StructureV3 results - - File: `backend/app/services/pp_structure_enhanced.py` - - Extract `doc_preprocessor_res` from PP-StructureV3 output - - Parse `label_names` to get detected rotation angle - - Pass rotation angle to caller - -## Phase 2: Dimension Adjustment - -- [x] **Task 2.1**: Add rotation angle to OCR result - - File: `backend/app/services/ocr_service.py` - - Receive rotation angle from `analyze_layout()` - - Include `detected_rotation` in result dict - -- [x] **Task 2.2**: Adjust page dimensions based on rotation - - File: `backend/app/services/ocr_service.py` - - In `process_image()`, after getting `ocr_width, ocr_height` from PIL - - If `detected_rotation` is "90" or "270", swap dimensions - - Log dimension adjustment for debugging - -- [x] **Task 2.3**: Pass adjusted dimensions to UnifiedDocument - - File: `backend/app/services/ocr_to_unified_converter.py` - - Verified: `Page.dimensions` uses the adjusted width/height from `enhanced_results` - - No coordinate transformation needed (already based on rotated image) - -## Phase 3: Testing & Validation - -- [ ] **Task 3.1**: Test with portrait PDF containing landscape scan - - Verify output PDF is landscape - - Verify text is correctly oriented - - Verify text positioning is accurate - -- [ ] **Task 3.2**: Test with landscape PDF containing portrait scan - - Verify output PDF is portrait - - Verify text is correctly oriented - -- [ ] **Task 3.3**: Test with correctly oriented documents - - Verify no regression for normal documents - - Both portrait and landscape normal scans - -- [ ] **Task 3.4**: Test edge cases - - 180° rotated documents (upside down) - - Documents with mixed text orientations - -## Dependencies -- Task 1.1 and 1.2 can be done in parallel -- Task 2.1 depends on Task 1.2 -- Task 2.2 depends on Task 2.1 -- Task 2.3 depends on Task 2.2 -- All Phase 3 tasks depend on Phase 2 completion - -## Implementation Summary - -### Files Modified: -1. `backend/app/core/config.py` - Enabled `use_doc_orientation_classify=True` -2. `backend/app/services/pp_structure_enhanced.py` - Extract and return `detected_rotation` -3. `backend/app/services/ocr_service.py` - Adjust dimensions and add rotation to result - -### Key Changes: -- PP-StructureV3 now detects document orientation (0°/90°/180°/270°) -- When 90° or 270° rotation detected, page dimensions are swapped (width ↔ height) -- `detected_rotation` is included in OCR result for debugging/logging -- Coordinates from PP-StructureV3 are already in the rotated coordinate space diff --git a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md deleted file mode 100644 index 50f92fb..0000000 --- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md +++ /dev/null @@ -1,88 +0,0 @@ -## Context - -OCR Track 使用 PP-StructureV3 處理文件,將 PDF 轉換為 PNG 圖片(150 DPI)進行 OCR 識別,然後將結果轉換為 UnifiedDocument 格式並生成輸出 PDF。 - -當前問題: -1. 表格 HTML 內容在 bbox overlap 匹配路徑中未被提取 -2. PDF 生成時的座標縮放導致文字大小異常 - -## Goals / Non-Goals - -**Goals:** -- 修復表格 HTML 內容提取,確保所有表格都有正確的 `html` 和 `extracted_text` -- 修復 PDF 生成的座標系問題,確保文字大小正確 -- 保持 Direct Track 和 Hybrid Track 不受影響 - -**Non-Goals:** -- 不改變 PP-StructureV3 的調用方式 -- 不改變 UnifiedDocument 的資料結構 -- 不改變前端 API - -## Decisions - -### Decision 1: 表格 HTML 提取修復 - -**位置**: `pp_structure_enhanced.py` L527-534 - -**修改方案**: 在 bbox overlap 匹配成功時,同時提取 `pred_html`: - -```python -if best_match and best_overlap > 0.1: - cell_boxes = best_match['cell_box_list'] - element['cell_boxes'] = [[float(c) for c in box] for box in cell_boxes] - element['cell_boxes_source'] = 'table_res_list' - - # 新增:提取 pred_html - if not html_content and 'pred_html' in best_match: - html_content = best_match['pred_html'] - element['html'] = html_content - element['extracted_text'] = self._extract_text_from_html(html_content) - logger.info(f"[TABLE] Extracted HTML from table_res_list (bbox match)") -``` - -### Decision 2: OCR Track PDF 座標系處理 - -**方案 A(推薦)**: OCR Track 使用 OCR 座標系尺寸作為 PDF 頁面尺寸 - -- PDF 頁面尺寸直接使用 OCR 座標系尺寸(如 1275x1650 pixels → 1275x1650 pts) -- 不進行座標縮放,scale_x = scale_y = 1.0 -- 字體大小直接使用 bbox 高度,不需要額外計算 - -**優點**: -- 座標轉換簡單,不會有精度損失 -- 字體大小計算準確 -- PDF 頁面比例與原始文件一致 - -**缺點**: -- PDF 尺寸較大(約 Letter size 的 2 倍) -- 可能需要縮放查看 - -**方案 B**: 保持 Letter size,改進縮放計算 - -- 保持 PDF 頁面為 612x792 pts -- 正確計算 DPI 轉換因子 (72/150 = 0.48) -- 確保字體大小在縮放時保持可讀性 - -**選擇**: 採用方案 A,因為簡化實現且避免縮放精度問題。 - -### Decision 3: 表格質量判定調整 - -**當前問題**: `_check_cell_boxes_quality()` 過度過濾有效表格 - -**修改方案**: -1. 提高 cell_density 閾值(從 3.0 → 5.0 cells/10000px²) -2. 降低 min_avg_cell_area 閾值(從 3000 → 2000 px²) -3. 添加詳細日誌說明具體哪個指標不符合 - -## Risks / Trade-offs - -- **風險**: 修改座標系可能影響現有的 PDF 輸出格式 -- **緩解**: 只對 OCR Track 生效,Direct Track 保持原有邏輯 - -- **風險**: 放寬表格質量判定可能導致一些真正的低質量表格被渲染 -- **緩解**: 逐步調整閾值,先在測試文件上驗證效果 - -## Open Questions - -1. OCR Track PDF 尺寸變大是否會影響用戶體驗? -2. 是否需要提供配置選項讓用戶選擇 PDF 輸出尺寸? diff --git a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md deleted file mode 100644 index b89f75d..0000000 --- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md +++ /dev/null @@ -1,17 +0,0 @@ -# Change: Fix OCR Track Table Rendering and Text Sizing - -## Why -OCR Track 處理產生的 PDF 有兩個主要問題: -1. **表格內容消失**:PP-StructureV3 正確返回了 `table_res_list`(包含 `pred_html` 和 `cell_box_list`),但 `pp_structure_enhanced.py` 在通過 bbox overlap 匹配時只提取了 `cell_boxes` 而沒有提取 `pred_html`,導致表格的 HTML 內容為空。 -2. **文字大小不一致**:OCR 座標系 (1275x1650 pixels) 與 PDF 輸出尺寸 (612x792 pts) 之間的縮放因子 (0.48) 導致字體大小計算不準確,文字過小或大小不一致。 - -## What Changes -- 修復 `pp_structure_enhanced.py` 中 bbox overlap 匹配時的 HTML 提取邏輯 -- 改進 `pdf_generator_service.py` 中 OCR Track 的座標系處理,使用 OCR 座標系尺寸作為 PDF 輸出尺寸 -- 調整 `_check_cell_boxes_quality()` 函數的判定邏輯,避免過度過濾有效表格 - -## Impact -- Affected specs: `ocr-processing` -- Affected code: - - `backend/app/services/pp_structure_enhanced.py` - 表格 HTML 提取邏輯 - - `backend/app/services/pdf_generator_service.py` - PDF 生成座標系處理 diff --git a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md deleted file mode 100644 index 54234b0..0000000 --- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md +++ /dev/null @@ -1,91 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Enhanced OCR with Full PP-StructureV3 - -The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates. - -#### Scenario: Extract comprehensive document structure -- **WHEN** processing through OCR track -- **THEN** the system SHALL use page_result.json['parsing_res_list'] -- **AND** extract all element types including headers, lists, tables, figures -- **AND** preserve layout_bbox coordinates for each element - -#### Scenario: Maintain reading order -- **WHEN** extracting elements from PP-StructureV3 -- **THEN** the system SHALL preserve the reading order from parsing_res_list -- **AND** assign sequential indices to elements -- **AND** support reordering for complex layouts - -#### Scenario: Extract table structure with HTML content -- **WHEN** PP-StructureV3 identifies a table -- **THEN** the system SHALL extract cell content and boundaries from table_res_list -- **AND** extract pred_html for table HTML content -- **AND** validate cell_boxes coordinates against page boundaries -- **AND** apply fallback detection for invalid coordinates -- **AND** preserve table HTML for structure -- **AND** extract plain text for translation - -#### Scenario: Table matching via bbox overlap -- **GIVEN** a table element from parsing_res_list without direct HTML content -- **WHEN** matching against table_res_list using bbox overlap -- **AND** overlap ratio exceeds 10% -- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res -- **AND** set element['html'] to the extracted pred_html -- **AND** set element['extracted_text'] from the HTML content -- **AND** log the successful extraction - -#### Scenario: Extract visual elements with paths -- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM) -- **THEN** the system SHALL preserve saved_path for each element -- **AND** include image dimensions and format -- **AND** enable image embedding in output PDF - -## ADDED Requirements - -### Requirement: OCR Track PDF Coordinate System - -The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning. - -#### Scenario: PDF page size matches OCR coordinate system -- **GIVEN** an OCR track processing task -- **WHEN** generating the output PDF -- **THEN** the system SHALL use the OCR image dimensions as PDF page size -- **AND** set scale factors to 1.0 (no scaling) -- **AND** preserve original bbox coordinates without transformation - -#### Scenario: Text font size calculation without scaling -- **GIVEN** a text element with bbox height H in OCR coordinates -- **WHEN** rendering text in PDF -- **THEN** the system SHALL calculate font size based directly on bbox height -- **AND** NOT apply additional scaling factors -- **AND** ensure readable text output - -#### Scenario: Direct Track PDF maintains original size -- **GIVEN** a direct track processing task -- **WHEN** generating the output PDF -- **THEN** the system SHALL use the original PDF page dimensions -- **AND** preserve existing coordinate transformation logic -- **AND** NOT be affected by OCR Track coordinate changes - -### Requirement: Table Cell Quality Assessment - -The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables. - -#### Scenario: Cell density threshold -- **GIVEN** a table with cell_boxes from PP-StructureV3 -- **WHEN** cell density exceeds 5.0 cells per 10,000 px² -- **THEN** the system SHALL flag the table as potentially over-detected -- **AND** log the specific density value for debugging - -#### Scenario: Average cell area threshold -- **GIVEN** a table with cell_boxes -- **WHEN** average cell area is less than 2,000 px² -- **THEN** the system SHALL flag the table as potentially over-detected -- **AND** log the specific area value for debugging - -#### Scenario: Valid tables with normal metrics -- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px² -- **WHEN** quality assessment is applied -- **THEN** the table SHALL be considered valid -- **AND** cell_boxes SHALL be used for rendering -- **AND** table content SHALL be displayed in PDF output diff --git a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md deleted file mode 100644 index e0a8b0d..0000000 --- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md +++ /dev/null @@ -1,34 +0,0 @@ -## 1. Fix Table HTML Extraction - -### 1.1 pp_structure_enhanced.py -- [x] 1.1.1 在 bbox overlap 匹配時(L527-534)添加 `pred_html` 提取邏輯 -- [x] 1.1.2 確保 `element['html']` 在所有匹配路徑都被正確設置 -- [x] 1.1.3 添加 `extracted_text` 從 HTML 提取純文字內容 -- [x] 1.1.4 添加日誌記錄 HTML 提取狀態 - -## 2. Fix PDF Coordinate System - -### 2.1 pdf_generator_service.py -- [x] 2.1.1 對於 OCR Track,使用 OCR 座標系尺寸 (如 1275x1650) 作為 PDF 頁面尺寸 -- [x] 2.1.2 修改 `_get_page_size_for_track()` 方法區分 OCR/Direct track -- [x] 2.1.3 調整字體大小計算,避免因縮放導致文字過小 -- [x] 2.1.4 確保座標轉換在 OCR Track 時不進行額外縮放 - -## 3. Improve Table Cell Quality Check - -### 3.1 pdf_generator_service.py -- [x] 3.1.1 審查 `_check_cell_boxes_quality()` 判定條件 -- [x] 3.1.2 放寬或調整判定閾值,避免過度過濾有效表格 (overlap threshold 10% → 25%) -- [x] 3.1.3 添加更詳細的日誌說明為何表格被判定為 "bad quality" - -### 3.2 Fix Table Content Rendering -- [x] 3.2.1 發現問題:`_draw_table_with_cell_boxes` 只渲染邊框,不渲染文字內容 -- [x] 3.2.2 添加 `cell_boxes_rendered` flag 追蹤邊框是否已渲染 -- [x] 3.2.3 修改邏輯:cell_boxes 渲染邊框後繼續使用 ReportLab Table 渲染文字 -- [x] 3.2.4 條件性跳過 GRID style 當 cell_boxes 已渲染邊框時 - -## 4. Testing -- [x] 4.1 使用 edit.pdf 測試修復後的 OCR Track 處理 -- [x] 4.2 驗證表格 HTML 正確提取並渲染 -- [x] 4.3 驗證文字大小一致且清晰可讀 -- [ ] 4.4 確認其他文件類型不受影響 diff --git a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/design.md b/openspec/changes/archive/2025-12-11-fix-table-column-alignment/design.md deleted file mode 100644 index 4102c8f..0000000 --- a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/design.md +++ /dev/null @@ -1,227 +0,0 @@ -# Design: Table Column Alignment Correction - -## Context - -PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for: -- Tables with unclear left borders -- Cells containing vertical Chinese text -- Complex merged cells - -This design introduces a **post-processing correction layer** that validates and fixes column assignments using geometric coordinates. - -## Goals / Non-Goals - -**Goals:** -- Correct column shift errors without modifying PP-Structure model -- Use header row as authoritative column reference -- Merge fragmented vertical text into proper cells -- Maintain backward compatibility with existing pipeline - -**Non-Goals:** -- Training new OCR/structure models -- Modifying PP-Structure's internal behavior -- Handling tables without clear headers (future enhancement) - -## Architecture - -``` -PP-Structure Output - │ - ▼ -┌───────────────────┐ -│ Table Column │ -│ Corrector │ -│ (new middleware) │ -├───────────────────┤ -│ 1. Extract header │ -│ column ranges │ -│ 2. Validate cells │ -│ 3. Correct col │ -│ assignments │ -└───────────────────┘ - │ - ▼ - PDF Generator -``` - -## Decisions - -### Decision 1: Header-Anchor Algorithm - -**Approach:** Use first row (row_idx=0) cells as column anchors. - -**Algorithm:** -```python -def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]: - """ - Extract X-coordinate ranges from header row to define column boundaries. - - Returns: - List of ColumnAnchor(col_idx, x_min, x_max) - """ - anchors = [] - for cell in header_cells: - anchors.append(ColumnAnchor( - col_idx=cell.col_idx, - x_min=cell.bbox.x0, - x_max=cell.bbox.x1 - )) - return sorted(anchors, key=lambda a: a.x_min) - - -def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int: - """ - Find the correct column index based on X-coordinate overlap. - - Strategy: - 1. Calculate overlap with each column anchor - 2. If overlap > 50% with different column, correct it - 3. If no overlap, find nearest column by center point - """ - cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2 - - # Find best matching anchor - best_anchor = None - best_overlap = 0 - - for anchor in anchors: - overlap = calculate_x_overlap(cell.bbox, anchor) - if overlap > best_overlap: - best_overlap = overlap - best_anchor = anchor - - # If significant overlap with different column, correct - if best_anchor and best_overlap > 0.5: - if best_anchor.col_idx != cell.col_idx: - logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}") - return best_anchor.col_idx - - return cell.col_idx -``` - -**Why this approach:** -- Headers are typically the most accurately recognized row -- X-coordinates are objective measurements, not semantic inference -- Simple O(n*m) complexity (n cells, m columns) - -### Decision 2: Vertical Fragment Merging - -**Detection criteria for vertical text fragments:** -1. Width << Height (aspect ratio < 0.3) -2. Located in leftmost 15% of table -3. X-center deviation < 10px between consecutive blocks -4. Y-gap < 20px (adjacent in vertical direction) - -**Merge strategy:** -```python -def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]: - """ - Merge vertically stacked narrow text blocks into single blocks. - """ - # Filter candidates: narrow blocks in left margin - left_boundary = table_bbox.x0 + (table_bbox.width * 0.15) - candidates = [b for b in blocks - if b.width < b.height * 0.3 - and b.center_x < left_boundary] - - # Sort by Y position - candidates.sort(key=lambda b: b.y0) - - # Merge adjacent blocks - merged = [] - current_group = [] - - for block in candidates: - if not current_group: - current_group.append(block) - elif should_merge(current_group[-1], block): - current_group.append(block) - else: - merged.append(merge_group(current_group)) - current_group = [block] - - if current_group: - merged.append(merge_group(current_group)) - - return merged -``` - -### Decision 3: Data Sources - -**Primary source:** `cell_boxes` from PP-Structure -- Contains accurate geometric coordinates for each detected cell -- Independent of HTML structure recognition - -**Secondary source:** HTML content with row/col attributes -- Contains text content and structure -- May have incorrect col assignments (the problem we're fixing) - -**Correlation:** Match HTML cells to cell_boxes using IoU (Intersection over Union): -```python -def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]: - """Find the cell_box that best matches this HTML cell's position.""" - best_iou = 0 - best_box = None - - for box in cell_boxes: - iou = calculate_iou(html_cell.inferred_bbox, box) - if iou > best_iou: - best_iou = iou - best_box = box - - return best_box if best_iou > 0.3 else None -``` - -## Configuration - -```python -# config.py additions -table_column_correction_enabled: bool = Field( - default=True, - description="Enable header-anchor column correction" -) -table_column_correction_threshold: float = Field( - default=0.5, - description="Minimum X-overlap ratio to trigger column correction" -) -vertical_fragment_merge_enabled: bool = Field( - default=True, - description="Enable vertical text fragment merging" -) -vertical_fragment_aspect_ratio: float = Field( - default=0.3, - description="Max width/height ratio to consider as vertical text" -) -``` - -## Risks / Trade-offs - -| Risk | Mitigation | -|------|------------| -| Headers themselves misaligned | Fall back to original column assignments | -| Multi-row headers | Support colspan detection in header extraction | -| Tables without headers | Skip correction, use original structure | -| Performance overhead | O(n*m) is negligible for typical table sizes | - -## Integration Points - -1. **Input:** PP-Structure's `table_res` containing: - - `cell_boxes`: List of [x0, y0, x1, y1] coordinates - - `html`: Table HTML with row/col attributes - -2. **Output:** Corrected table structure with: - - Updated col indices in HTML cells - - Merged vertical text blocks - - Diagnostic logs for corrections made - -3. **Trigger location:** After PP-Structure table recognition, before PDF generation - - File: `pdf_generator_service.py` - - Method: `draw_table_region()` or new preprocessing step - -## Open Questions - -1. **Q:** How to handle tables where header row itself is misaligned? - **A:** Could add a secondary validation using cell_boxes grid inference, but start simple. - -2. **Q:** Should corrections be logged for user review? - **A:** Yes, add detailed logging with before/after column indices. diff --git a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/proposal.md b/openspec/changes/archive/2025-12-11-fix-table-column-alignment/proposal.md deleted file mode 100644 index 961085a..0000000 --- a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/proposal.md +++ /dev/null @@ -1,56 +0,0 @@ -# Change: Fix Table Column Alignment with Header-Anchor Correction - -## Why - -PP-Structure's table structure recognition frequently outputs cells with incorrect column indices, causing "column shift" where content appears in the wrong column. This happens because: - -1. **Semantic over Geometric**: The model infers row/col from semantic patterns rather than physical coordinates -2. **Vertical text fragmentation**: Chinese vertical text (e.g., "报价内容") gets split into fragments -3. **Missing left boundary**: When table's left border is unclear, cells shift left incorrectly - -The result: A cell with X-coordinate 213 gets assigned to column 0 (range 96-162) instead of column 1 (range 204-313). - -## What Changes - -- **Add Header-Anchor Alignment**: Use the first row (header) X-coordinates as column reference points -- **Add Coordinate-Based Column Correction**: Validate and correct cell column assignments based on X-coordinate overlap with header columns -- **Add Vertical Fragment Merging**: Detect and merge vertically stacked narrow text blocks that represent vertical text -- **Add Configuration Options**: Enable/disable correction features independently - -## Impact - -- Affected specs: `document-processing` -- Affected code: - - `backend/app/services/table_column_corrector.py` (new) - - `backend/app/services/pdf_generator_service.py` - - `backend/app/core/config.py` - -## Problem Analysis - -### Example: scan.pdf Table 7 - -**Raw PP-Structure Output:** -``` -Row 5: "3、適應產品..." at X=213 - Model says: col=0 - -Header Row 0: - - Column 0 (序號): X range [96, 162] - - Column 1 (產品名稱): X range [204, 313] -``` - -**Problem:** X=213 is far outside column 0's range (max 162), but perfectly within column 1's range (starts at 204). - -**Solution:** Force-correct col=0 → col=1 based on X-coordinate alignment with header. - -### Vertical Text Issue - -**Raw OCR:** -``` -Block A: "报价内" at X≈100, Y=[100, 200] -Block B: "容--" at X≈102, Y=[200, 300] -``` - -**Problem:** These should be one cell spanning multiple rows, but appear as separate fragments. - -**Solution:** Merge vertically aligned narrow blocks before structure recognition. diff --git a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-11-fix-table-column-alignment/specs/document-processing/spec.md deleted file mode 100644 index 200359d..0000000 --- a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/specs/document-processing/spec.md +++ /dev/null @@ -1,59 +0,0 @@ -## ADDED Requirements - -### Requirement: Table Column Alignment Correction -The system SHALL correct table cell column assignments using header-anchor alignment when PP-Structure outputs incorrect column indices. - -#### Scenario: Correct column shift using header anchors -- **WHEN** processing a table with cell_boxes and HTML content -- **THEN** the system SHALL extract header row (row_idx=0) column X-coordinate ranges -- **AND** validate each cell's column assignment against header X-ranges -- **AND** correct column index if cell X-overlap with assigned column is < 50% -- **AND** assign cell to column with highest X-overlap - -#### Scenario: Handle tables without headers -- **WHEN** processing a table without a clear header row -- **THEN** the system SHALL skip column correction -- **AND** use original PP-Structure column assignments -- **AND** log that header-anchor correction was skipped - -#### Scenario: Log column corrections -- **WHEN** a cell's column index is corrected -- **THEN** the system SHALL log original and corrected column indices -- **AND** include cell content snippet for debugging -- **AND** record total corrections per table - -### Requirement: Vertical Text Fragment Merging -The system SHALL detect and merge vertically fragmented Chinese text blocks that represent single cells spanning multiple rows. - -#### Scenario: Detect vertical text fragments -- **WHEN** processing table text regions -- **THEN** the system SHALL identify narrow text blocks (width/height ratio < 0.3) -- **AND** filter blocks in leftmost 15% of table area -- **AND** group vertically adjacent blocks with X-center deviation < 10px - -#### Scenario: Merge fragmented vertical text -- **WHEN** vertical text fragments are detected -- **THEN** the system SHALL merge adjacent fragments into single text blocks -- **AND** combine text content preserving reading order -- **AND** calculate merged bounding box spanning all fragments -- **AND** treat merged block as single cell for column assignment - -#### Scenario: Preserve non-vertical text -- **WHEN** text blocks do not meet vertical fragment criteria -- **THEN** the system SHALL preserve original text block boundaries -- **AND** process normally without merging - -## MODIFIED Requirements - -### Requirement: Extract table structure -The system SHALL extract cell content and boundaries from PP-StructureV3 tables, with post-processing correction for column alignment errors. - -#### Scenario: Extract table structure with correction -- **WHEN** PP-StructureV3 identifies a table -- **THEN** the system SHALL extract cell content and boundaries -- **AND** validate cell_boxes coordinates against page boundaries -- **AND** apply header-anchor column correction when enabled -- **AND** merge vertical text fragments when enabled -- **AND** apply fallback detection for invalid coordinates -- **AND** preserve table HTML for structure -- **AND** extract plain text for translation diff --git a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md b/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md deleted file mode 100644 index b7b6605..0000000 --- a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md +++ /dev/null @@ -1,59 +0,0 @@ -## 1. Core Algorithm Implementation - -### 1.1 Table Column Corrector Module -- [x] 1.1.1 Create `table_column_corrector.py` service file -- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges -- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges -- [x] 1.1.4 Implement `calculate_x_overlap()` utility function -- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction -- [x] 1.1.6 Implement `correct_table_columns()` main entry point - -### 1.2 HTML Cell Extraction -- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col -- [x] 1.2.2 Implement cell-to-cellbox matching using IoU -- [x] 1.2.3 Handle colspan/rowspan in header detection - -### 1.3 Vertical Fragment Merging -- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks -- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check -- [x] 1.3.3 Implement `merge_vertical_fragments()` main function -- [x] 1.3.4 Integrate merged blocks back into table structure - -## 2. Configuration - -### 2.1 Settings -- [x] 2.1.1 Add `table_column_correction_enabled: bool = True` -- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5` -- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True` -- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3` - -## 3. Integration - -### 3.1 Pipeline Integration -- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering -- [x] 3.1.2 Pass corrected HTML to existing table rendering logic -- [x] 3.1.3 Add diagnostic logging for corrections made - -### 3.2 Error Handling -- [x] 3.2.1 Handle tables without headers gracefully -- [x] 3.2.2 Handle empty/malformed cell_boxes -- [x] 3.2.3 Fallback to original structure on correction failure - -## 4. Testing - -### 4.1 Unit Tests -- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations -- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases -- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples -- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers - -### 4.2 Integration Tests -- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case) -- [ ] 4.2.2 Test with tables that have correct alignment (no regression) -- [ ] 4.2.3 Visual comparison of corrected vs original output - -## 5. Documentation - -- [x] 5.1 Add inline code comments explaining correction algorithm -- [x] 5.2 Update spec with new table column correction requirement -- [x] 5.3 Add logging messages for debugging diff --git a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/proposal.md b/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/proposal.md deleted file mode 100644 index 6388c82..0000000 --- a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/proposal.md +++ /dev/null @@ -1,49 +0,0 @@ -# Change: Improve OCR Track Algorithm Based on PP-StructureV3 Best Practices - -## Why - -目前 OCR Track 的 Gap Filling 演算法使用 **IoU (Intersection over Union)** 判斷 OCR 文字是否被 Layout 區域覆蓋。根據 PaddleX 官方文件 (paddle_review.md) 建議,應改用 **IoA (Intersection over Area)** 才能正確判斷「小框是否被大框包含」的非對稱關係。此外,現行使用統一閾值處理所有元素類型,但不同類型應有不同閾值策略。 - -## What Changes - -1. **IoU → IoA 演算法變更**: 將 `gap_filling_service.py` 中的覆蓋判定從 IoU 改為 IoA -2. **動態閾值策略**: 依元素類型 (TEXT, TABLE, FIGURE) 使用不同的 IoA 閾值 -3. **使用 PP-StructureV3 內建 OCR**: 改用 `overall_ocr_res` 取代獨立執行 Raw OCR,節省推理時間並確保座標一致 -4. **邊界收縮處理**: OCR 框內縮 1-2 px 避免邊緣重複渲染 - -## Impact - -- Affected specs: `ocr-processing` -- Affected code: - - `backend/app/services/gap_filling_service.py` - 核心演算法變更 - - `backend/app/services/ocr_service.py` - 改用 `overall_ocr_res` - - `backend/app/services/processing_orchestrator.py` - 調整 OCR 資料來源 - - `backend/app/core/config.py` - 新增元素類型閾值設定 - -## Technical Details - -### 1. IoA vs IoU - -``` -IoU = 交集面積 / 聯集面積 (對稱,用於判斷兩框是否指向同物體) -IoA = 交集面積 / OCR框面積 (非對稱,用於判斷小框是否被大框包含) -``` - -當 Layout 框遠大於 OCR 框時,IoU 會過小導致誤判為「未覆蓋」。 - -### 2. 動態閾值建議 - -| 元素類型 | IoA 閾值 | 說明 | -|---------|---------|------| -| TEXT/TITLE | 0.6 | 容忍邊界誤差 | -| TABLE | 0.1 | 嚴格過濾,避免破壞表格結構 | -| FIGURE | 0.8 | 保留圖中文字 (如軸標籤) | - -### 3. overall_ocr_res 驗證結果 - -已確認 PP-StructureV3 的 `json['res']['overall_ocr_res']` 包含: -- `dt_polys`: 檢測框座標 (polygon 格式) -- `rec_texts`: 識別文字 -- `rec_scores`: 識別信心度 - -測試結果顯示與獨立執行 Raw OCR 的結果數量相同 (59 regions),可安全替換。 diff --git a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/specs/ocr-processing/spec.md deleted file mode 100644 index 332c65e..0000000 --- a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/specs/ocr-processing/spec.md +++ /dev/null @@ -1,142 +0,0 @@ -## MODIFIED Requirements - -### Requirement: OCR Track Gap Filling with Raw OCR Regions - -The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected. - -#### Scenario: Gap filling activates when coverage is low -- **GIVEN** an OCR track processing task -- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions -- **THEN** the system SHALL activate gap filling -- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element -- **AND** supplement these regions as TEXT elements in the output - -#### Scenario: Coverage is determined by IoA (Intersection over Area) -- **GIVEN** a Raw OCR text region with bounding box -- **WHEN** checking if the region is covered by PP-StructureV3 -- **THEN** the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold -- **AND** IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship -- **AND** regions not meeting the IoA criterion SHALL be marked as uncovered - -#### Scenario: Element-type-specific IoA thresholds are applied -- **GIVEN** a Raw OCR region being evaluated for coverage -- **WHEN** comparing against PP-StructureV3 elements of different types -- **THEN** the system SHALL apply different IoA thresholds: - - TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors) - - TABLE: IoA > 0.1 (strict filtering to preserve table structure) - - FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels) -- **AND** a region is considered covered if it meets the threshold for ANY overlapping element - -#### Scenario: Only TEXT elements are supplemented -- **GIVEN** uncovered Raw OCR regions identified for supplementation -- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements -- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements -- **AND** only supplement regions as TEXT type to preserve structural integrity - -#### Scenario: Supplemented regions meet confidence threshold -- **GIVEN** Raw OCR regions to be supplemented -- **WHEN** a region has confidence score below 0.3 -- **THEN** the system SHALL skip that region -- **AND** only supplement regions with confidence >= 0.3 - -#### Scenario: Deduplication uses IoA instead of IoU -- **GIVEN** a Raw OCR region being considered for supplementation -- **WHEN** the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element -- **THEN** the system SHALL skip that region to prevent duplicate text -- **AND** the original PP-StructureV3 element SHALL be preserved - -#### Scenario: Reading order is recalculated after gap filling -- **GIVEN** supplemented elements have been added to the page -- **WHEN** assembling the final element list -- **THEN** the system SHALL recalculate reading order for the entire page -- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right) -- **AND** ensure logical document flow is maintained - -#### Scenario: Coordinate alignment with ocr_dimensions -- **GIVEN** Raw OCR processing may involve image resizing -- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox -- **THEN** the system SHALL use ocr_dimensions to normalize coordinates -- **AND** ensure both sources reference the same coordinate space -- **AND** prevent coverage misdetection due to scale differences - -#### Scenario: Supplemented elements have complete metadata -- **GIVEN** a Raw OCR region being added as supplemented element -- **WHEN** creating the DocumentElement -- **THEN** the element SHALL include page_number -- **AND** include confidence score from Raw OCR -- **AND** include original bbox coordinates -- **AND** optionally include source indicator for debugging - -### Requirement: Gap Filling Configuration - -The system SHALL provide configurable parameters for gap filling behavior. - -#### Scenario: Gap filling can be disabled via configuration -- **GIVEN** gap_filling_enabled is set to false in configuration -- **WHEN** OCR track processing runs -- **THEN** the system SHALL skip all gap filling logic -- **AND** output only PP-StructureV3 results as before - -#### Scenario: Coverage threshold is configurable -- **GIVEN** gap_filling_coverage_threshold is set to 0.8 -- **WHEN** PP-StructureV3 coverage is 75% -- **THEN** the system SHALL activate gap filling -- **AND** supplement uncovered regions - -#### Scenario: IoA thresholds are configurable per element type -- **GIVEN** custom IoA thresholds configured: - - gap_filling_ioa_threshold_text: 0.6 - - gap_filling_ioa_threshold_table: 0.1 - - gap_filling_ioa_threshold_figure: 0.8 - - gap_filling_dedup_ioa_threshold: 0.5 -- **WHEN** evaluating coverage and deduplication -- **THEN** the system SHALL use the configured values -- **AND** apply them consistently throughout gap filling process - -#### Scenario: Confidence threshold is configurable -- **GIVEN** gap_filling_confidence_threshold is set to 0.5 -- **WHEN** supplementing Raw OCR regions -- **THEN** the system SHALL only include regions with confidence >= 0.5 -- **AND** filter out lower confidence regions - -#### Scenario: Boundary shrinking reduces edge duplicates -- **GIVEN** gap_filling_shrink_pixels is set to 1 -- **WHEN** evaluating coverage with IoA -- **THEN** the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side -- **AND** this reduces false "uncovered" detection at region boundaries - -## ADDED Requirements - -### Requirement: Use PP-StructureV3 Internal OCR Results - -The system SHALL preferentially use PP-StructureV3's internal OCR results (`overall_ocr_res`) instead of running a separate Raw OCR inference. - -#### Scenario: Extract overall_ocr_res from PP-StructureV3 -- **GIVEN** PP-StructureV3 processing completes -- **WHEN** the result contains `json['res']['overall_ocr_res']` -- **THEN** the system SHALL extract OCR regions from: - - `dt_polys`: detection box polygons - - `rec_texts`: recognized text strings - - `rec_scores`: confidence scores -- **AND** convert these to the standard TextRegion format for gap filling - -#### Scenario: Skip separate Raw OCR when overall_ocr_res is available -- **GIVEN** gap_filling_use_overall_ocr is true (default) -- **WHEN** PP-StructureV3 result contains overall_ocr_res -- **THEN** the system SHALL NOT execute separate PaddleOCR inference -- **AND** use the extracted overall_ocr_res as the OCR source -- **AND** this reduces total inference time by approximately 50% - -#### Scenario: Fallback to separate Raw OCR when needed -- **GIVEN** gap_filling_use_overall_ocr is false OR overall_ocr_res is missing -- **WHEN** gap filling is activated -- **THEN** the system SHALL execute separate PaddleOCR inference as before -- **AND** use the separate OCR results for gap filling -- **AND** this maintains backward compatibility - -#### Scenario: Coordinate consistency is guaranteed -- **GIVEN** overall_ocr_res is extracted from PP-StructureV3 -- **WHEN** comparing with PP-StructureV3 layout elements -- **THEN** both SHALL use the same coordinate system -- **AND** no additional coordinate alignment is needed -- **AND** this prevents scale mismatch issues diff --git a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/tasks.md b/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/tasks.md deleted file mode 100644 index f58d543..0000000 --- a/openspec/changes/archive/2025-12-11-improve-ocr-track-algorithm/tasks.md +++ /dev/null @@ -1,54 +0,0 @@ -## 1. Algorithm Changes (gap_filling_service.py) - -### 1.1 IoA Implementation -- [x] 1.1.1 Add `_calculate_ioa()` method alongside existing `_calculate_iou()` -- [x] 1.1.2 Modify `_is_region_covered()` to use IoA instead of IoU -- [x] 1.1.3 Update deduplication logic to use IoA - -### 1.2 Dynamic Threshold Strategy -- [x] 1.2.1 Add element-type-specific thresholds as class constants -- [x] 1.2.2 Modify `_is_region_covered()` to accept element type parameter -- [x] 1.2.3 Apply different thresholds based on element type (TEXT: 0.6, TABLE: 0.1, FIGURE: 0.8) - -### 1.3 Boundary Shrinking -- [x] 1.3.1 Add optional `shrink_pixels` parameter to coverage detection -- [x] 1.3.2 Implement bbox shrinking logic (inward 1-2 px) - -## 2. OCR Data Source Changes - -### 2.1 Extract overall_ocr_res from PP-StructureV3 -- [x] 2.1.1 Modify `pp_structure_enhanced.py` to extract `overall_ocr_res` from result -- [x] 2.1.2 Convert `dt_polys` + `rec_texts` + `rec_scores` to TextRegion format -- [x] 2.1.3 Store extracted OCR in result dict for gap filling - -### 2.2 Update Processing Orchestrator -- [x] 2.2.1 Add option to use `overall_ocr_res` as OCR source -- [x] 2.2.2 Skip separate Raw OCR inference when using PP-StructureV3's OCR -- [x] 2.2.3 Maintain backward compatibility with explicit Raw OCR mode - -## 3. Configuration Updates - -### 3.1 Add Settings (config.py) -- [x] 3.1.1 Add `gap_filling_ioa_threshold_text: float = 0.6` -- [x] 3.1.2 Add `gap_filling_ioa_threshold_table: float = 0.1` -- [x] 3.1.3 Add `gap_filling_ioa_threshold_figure: float = 0.8` -- [x] 3.1.4 Add `gap_filling_use_overall_ocr: bool = True` -- [x] 3.1.5 Add `gap_filling_shrink_pixels: int = 1` - -## 4. Testing - -### 4.1 Unit Tests -- [ ] 4.1.1 Test IoA calculation with known values -- [ ] 4.1.2 Test dynamic threshold selection by element type -- [ ] 4.1.3 Test boundary shrinking edge cases - -### 4.2 Integration Tests -- [ ] 4.2.1 Test with scan.pdf (current problematic file) -- [ ] 4.2.2 Compare results: old IoU vs new IoA approach -- [ ] 4.2.3 Verify no duplicate text rendering in output PDF -- [ ] 4.2.4 Verify table content is not duplicated outside table bounds - -## 5. Documentation - -- [x] 5.1 Update spec documentation with new algorithm -- [x] 5.2 Add inline code comments explaining IoA vs IoU diff --git a/openspec/changes/archive/2025-12-11-remove-unused-code/proposal.md b/openspec/changes/archive/2025-12-11-remove-unused-code/proposal.md deleted file mode 100644 index 136fc6b..0000000 --- a/openspec/changes/archive/2025-12-11-remove-unused-code/proposal.md +++ /dev/null @@ -1,55 +0,0 @@ -# Change: Remove Unused Code and Legacy Files - -## Why - -專案經過多次迭代開發後,累積了一些未使用的代碼和遺留文件。這些冗餘代碼增加了維護負擔、可能造成混淆,並佔用不必要的存儲空間。本提案旨在系統性地移除這些未使用的代碼,以達成專案內容及程式代碼的精簡。 - -## What Changes - -### Backend - 移除未使用的服務文件 (3個) - -| 文件 | 行數 | 移除原因 | -|------|------|----------| -| `ocr_service_original.py` | ~835 | 舊版 OCR 服務,已被 `ocr_service.py` 完全取代 | -| `preprocessor.py` | ~200 | 文檔預處理器,功能已被 `layout_preprocessing_service.py` 吸收 | -| `pdf_font_manager.py` | ~150 | 字體管理器,未被任何服務引用 | - -### Frontend - 移除未使用的組件 (2個) - -| 文件 | 移除原因 | -|------|----------| -| `MarkdownPreview.tsx` | 完全未被任何頁面或組件引用 | -| `ResultsTable.tsx` | 使用已棄用的 `FileResult` 類型,功能已被 `TaskHistoryPage` 替代 | - -### Frontend - 遷移並移除遺留 API 服務 (2個) - -| 文件 | 移除原因 | -|------|----------| -| `services/api.ts` | 舊版 API 客戶端,僅剩 2 處引用 (Layout.tsx, SettingsPage.tsx),需遷移至 apiV2 | -| `types/api.ts` | 舊版類型定義,僅 `ExportRule` 類型被使用,需遷移至 apiV2.ts | - -## Impact - -- **Affected specs**: 無 (純代碼清理,不改變系統行為) -- **Affected code**: - - Backend: `backend/app/services/` (刪除 3 個文件) - - Frontend: `frontend/src/components/` (刪除 2 個文件) - - Frontend: `frontend/src/services/api.ts` (遷移後刪除) - - Frontend: `frontend/src/types/api.ts` (遷移後刪除) - -## Benefits - -- 減少約 1,200+ 行後端冗餘代碼 -- 減少約 300+ 行前端冗餘代碼 -- 提高代碼維護性和可讀性 -- 消除新開發者的混淆源 -- 統一 API 客戶端到 apiV2 - -## Risk Assessment - -- **風險等級**: 低 -- **回滾策略**: Git revert 即可恢復所有刪除的文件 -- **測試要求**: - - 確認後端服務啟動正常 - - 確認前端所有頁面功能正常 - - 特別測試 SettingsPage (ExportRule) 功能 diff --git a/openspec/changes/archive/2025-12-11-remove-unused-code/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-11-remove-unused-code/specs/document-processing/spec.md deleted file mode 100644 index 3ddb0f5..0000000 --- a/openspec/changes/archive/2025-12-11-remove-unused-code/specs/document-processing/spec.md +++ /dev/null @@ -1,61 +0,0 @@ -## REMOVED Requirements - -### Requirement: Legacy OCR Service Implementation - -**Reason**: `ocr_service_original.py` was the original OCR service implementation that has been completely superseded by the current `ocr_service.py`. The legacy file is no longer referenced by any part of the codebase. - -**Migration**: No migration needed. The current `ocr_service.py` provides all required functionality with improved architecture. - -#### Scenario: Legacy service file removal -- **WHEN** the legacy `ocr_service_original.py` file is removed -- **THEN** the system continues to function normally using `ocr_service.py` -- **AND** no import errors occur in any service or router - -### Requirement: Unused Preprocessor Service - -**Reason**: `preprocessor.py` was a document preprocessor that is no longer used. Its functionality has been absorbed by `layout_preprocessing_service.py`. - -**Migration**: No migration needed. The preprocessing functionality is available through `layout_preprocessing_service.py`. - -#### Scenario: Preprocessor file removal -- **WHEN** the unused `preprocessor.py` file is removed -- **THEN** the system continues to function normally -- **AND** layout preprocessing works correctly via `layout_preprocessing_service.py` - -### Requirement: Unused PDF Font Manager - -**Reason**: `pdf_font_manager.py` was intended for font management but is not referenced by `pdf_generator_service.py` or any other service. - -**Migration**: No migration needed. Font handling is managed within `pdf_generator_service.py` directly. - -#### Scenario: Font manager file removal -- **WHEN** the unused `pdf_font_manager.py` file is removed -- **THEN** PDF generation continues to work correctly -- **AND** fonts are rendered properly in generated PDFs - -### Requirement: Legacy Frontend Components - -**Reason**: `MarkdownPreview.tsx` and `ResultsTable.tsx` are frontend components that are not referenced by any page or component in the application. - -**Migration**: No migration needed. `MarkdownPreview` functionality is not currently used. `ResultsTable` functionality has been replaced by `TaskHistoryPage`. - -#### Scenario: Unused frontend component removal -- **WHEN** the unused `MarkdownPreview.tsx` and `ResultsTable.tsx` files are removed -- **THEN** the frontend application compiles successfully -- **AND** all pages render and function correctly - -### Requirement: Legacy API Client Migration - -**Reason**: `services/api.ts` and `types/api.ts` are legacy API client files with only 2 remaining references. These should be migrated to `apiV2` for consistency. - -**Migration**: -1. Move `ExportRule` type to `types/apiV2.ts` -2. Add export rules API functions to `services/apiV2.ts` -3. Update `SettingsPage.tsx` and `Layout.tsx` to use apiV2 -4. Remove legacy api.ts files - -#### Scenario: Legacy API client removal after migration -- **WHEN** the legacy `api.ts` files are removed after migration -- **THEN** all API calls use the unified `apiV2` client -- **AND** `SettingsPage` export rules functionality works correctly -- **AND** `Layout` logout functionality works correctly diff --git a/openspec/changes/archive/2025-12-11-remove-unused-code/tasks.md b/openspec/changes/archive/2025-12-11-remove-unused-code/tasks.md deleted file mode 100644 index 251c411..0000000 --- a/openspec/changes/archive/2025-12-11-remove-unused-code/tasks.md +++ /dev/null @@ -1,52 +0,0 @@ -# Tasks: Remove Unused Code and Legacy Files - -## Phase 1: Backend Cleanup (無依賴,可直接刪除) - -- [x] 1.1 確認 `ocr_service_original.py` 無任何引用 -- [x] 1.2 刪除 `backend/app/services/ocr_service_original.py` -- [x] 1.3 確認 `preprocessor.py` 無任何引用 -- [x] 1.4 刪除 `backend/app/services/preprocessor.py` -- [x] 1.5 確認 `pdf_font_manager.py` 無任何引用 -- [x] 1.6 刪除 `backend/app/services/pdf_font_manager.py` -- [x] 1.7 測試後端服務啟動正常 - -## Phase 2: Frontend Unused Components (無依賴,可直接刪除) - -- [x] 2.1 確認 `MarkdownPreview.tsx` 無任何引用 -- [x] 2.2 刪除 `frontend/src/components/MarkdownPreview.tsx` -- [x] 2.3 確認 `ResultsTable.tsx` 無任何引用 -- [x] 2.4 刪除 `frontend/src/components/ResultsTable.tsx` -- [x] 2.5 測試前端編譯正常 - -## Phase 3: Frontend API Migration (需先遷移再刪除) - -- [x] 3.1 將 `ExportRule` 類型從 `types/api.ts` 遷移到 `types/apiV2.ts` (已存在) -- [x] 3.2 在 `services/apiV2.ts` 中添加 export rules 相關 API 函數 -- [x] 3.3 更新 `SettingsPage.tsx` 使用 apiV2 的 ExportRule -- [x] 3.4 更新 `Layout.tsx` 移除對 api.ts 的依賴 -- [x] 3.5 確認 `services/api.ts` 無任何引用 -- [x] 3.6 刪除 `frontend/src/services/api.ts` -- [x] 3.7 確認 `types/api.ts` 無任何引用 -- [x] 3.8 刪除 `frontend/src/types/api.ts` -- [x] 3.9 測試前端所有功能正常 - -## Phase 4: Verification - -- [x] 4.1 運行後端測試 (Backend imports OK) -- [x] 4.2 運行前端編譯 `npm run build` (TypeScript errors are pre-existing, not from our changes) -- [x] 4.3 手動測試關鍵功能: - - [x] 登入/登出 (verified apiClientV2.logout works) - - [x] 文件上傳 (no changes to upload flow) - - [x] OCR 處理 (no changes to processing flow) - - [x] 結果查看 (no changes to results flow) - - [x] 導出設定頁面 (migrated to apiClientV2) -- [x] 4.4 確認無 console 錯誤或警告 (migration complete) - -## Summary - -| Category | Files Removed | Lines Deleted | -|----------|--------------|---------------| -| Backend Services | 3 | ~1,200 | -| Frontend Components | 2 | ~80 | -| Frontend API/Types | 2 | ~678 | -| **Total** | **7** | **~1,958** | diff --git a/openspec/changes/archive/2025-12-11-simple-text-positioning/design.md b/openspec/changes/archive/2025-12-11-simple-text-positioning/design.md deleted file mode 100644 index 82df80c..0000000 --- a/openspec/changes/archive/2025-12-11-simple-text-positioning/design.md +++ /dev/null @@ -1,141 +0,0 @@ -# Design: Simple Text Positioning - -## Architecture - -### Current Flow (Complex) -``` -Raw OCR → PP-Structure Analysis → Table Detection → HTML Parsing → -Column Correction → Cell Positioning → PDF Generation -``` - -### New Flow (Simple) -``` -Raw OCR → Text Region Extraction → Bbox Processing → -Rotation Calculation → Font Size Estimation → PDF Text Rendering -``` - -## Core Components - -### 1. TextRegionRenderer - -New service class to handle raw OCR text rendering: - -```python -class TextRegionRenderer: - """Render raw OCR text regions to PDF.""" - - def render_text_region( - self, - canvas: Canvas, - region: Dict, - scale_factor: float - ) -> None: - """ - Render a single OCR text region. - - Args: - canvas: ReportLab canvas - region: Raw OCR region with text and bbox - scale_factor: Coordinate scaling factor - """ -``` - -### 2. Bbox Processing - -Raw OCR bbox format (quadrilateral - 4 corner points): -```json -{ - "text": "LOCTITE", - "bbox": [[116, 76], [378, 76], [378, 128], [116, 128]], - "confidence": 0.98 -} -``` - -Processing steps: -1. **Center point**: Average of 4 corners -2. **Width/Height**: Distance between corners -3. **Rotation angle**: Angle of top edge from horizontal -4. **Font size**: Approximate from bbox height - -### 3. Rotation Calculation - -```python -def calculate_rotation(bbox: List[List[float]]) -> float: - """ - Calculate text rotation from bbox quadrilateral. - - Returns angle in degrees (counter-clockwise from horizontal). - """ - # Top-left to top-right vector - dx = bbox[1][0] - bbox[0][0] - dy = bbox[1][1] - bbox[0][1] - - # Angle in degrees - angle = math.atan2(dy, dx) * 180 / math.pi - return angle -``` - -### 4. Font Size Estimation - -```python -def estimate_font_size(bbox: List[List[float]], text: str) -> float: - """ - Estimate font size from bbox dimensions. - - Uses bbox height as primary indicator, adjusted for aspect ratio. - """ - # Calculate bbox height (average of left and right edges) - left_height = math.dist(bbox[0], bbox[3]) - right_height = math.dist(bbox[1], bbox[2]) - avg_height = (left_height + right_height) / 2 - - # Font size is approximately 70-80% of bbox height - return avg_height * 0.75 -``` - -## Integration Points - -### PDFGeneratorService - -Modify `draw_ocr_content()` to use simple text positioning: - -```python -def draw_ocr_content(self, canvas, content_data, page_info): - """Draw OCR content using simple text positioning.""" - - # Use raw OCR regions directly - raw_regions = content_data.get('raw_ocr_regions', []) - - for region in raw_regions: - self.text_renderer.render_text_region( - canvas, region, scale_factor - ) -``` - -### Configuration - -Add config option to enable/disable simple mode: - -```python -class OCRSettings: - simple_text_positioning: bool = Field( - default=True, - description="Use simple text positioning instead of table reconstruction" - ) -``` - -## File Changes - -| File | Change | -|------|--------| -| `app/services/text_region_renderer.py` | New - Text rendering logic | -| `app/services/pdf_generator_service.py` | Modify - Integration | -| `app/core/config.py` | Add - Configuration option | - -## Edge Cases - -1. **Overlapping text**: Regions may overlap slightly - render in reading order -2. **Very small text**: Minimum font size threshold (6pt) -3. **Rotated pages**: Handle 90/180/270 degree page rotation -4. **Empty regions**: Skip regions with empty text -5. **Unicode text**: Ensure font supports CJK characters diff --git a/openspec/changes/archive/2025-12-11-simple-text-positioning/proposal.md b/openspec/changes/archive/2025-12-11-simple-text-positioning/proposal.md deleted file mode 100644 index b782535..0000000 --- a/openspec/changes/archive/2025-12-11-simple-text-positioning/proposal.md +++ /dev/null @@ -1,42 +0,0 @@ -# Simple Text Positioning from Raw OCR - -## Summary - -Simplify OCR track PDF generation by rendering raw OCR text at correct positions without complex table structure reconstruction. - -## Problem - -Current OCR track processing has multiple failure points: -1. PP-Structure table structure recognition fails for borderless tables -2. Multi-column layouts get merged incorrectly into single tables -3. Table HTML reconstruction produces wrong cell positions -4. Complex column correction algorithms still can't fix fundamental structure errors - -Meanwhile, raw OCR (`raw_ocr_regions.json`) correctly identifies all text with accurate bounding boxes. - -## Solution - -Replace complex table reconstruction with simple text positioning: -1. Read raw OCR regions directly -2. Position text at bbox coordinates -3. Calculate text rotation from bbox quadrilateral shape -4. Estimate font size from bbox height -5. Skip table HTML parsing entirely for OCR track - -## Benefits - -- **Reliability**: Raw OCR text positions are accurate -- **Simplicity**: Eliminates complex table parsing logic -- **Performance**: Faster processing without structure analysis -- **Consistency**: Predictable output regardless of table type - -## Trade-offs - -- No table borders in output -- No cell structure (colspan, rowspan) -- Visual layout approximation rather than semantic structure - -## Scope - -- OCR track PDF generation only -- Direct track remains unchanged (uses native PDF text extraction) diff --git a/openspec/changes/archive/2025-12-11-simple-text-positioning/tasks.md b/openspec/changes/archive/2025-12-11-simple-text-positioning/tasks.md deleted file mode 100644 index b292a99..0000000 --- a/openspec/changes/archive/2025-12-11-simple-text-positioning/tasks.md +++ /dev/null @@ -1,57 +0,0 @@ -# Tasks: Simple Text Positioning - -## Phase 1: Core Implementation - -- [x] Create `TextRegionRenderer` class in `app/services/text_region_renderer.py` - - [x] Implement `calculate_rotation()` from bbox quadrilateral - - [x] Implement `estimate_font_size()` from bbox height - - [x] Implement `render_text_region()` main method - - [x] Handle coordinate system transformation (OCR → PDF) - -## Phase 2: Integration - -- [x] Add `simple_text_positioning_enabled` config option -- [x] Modify `PDFGeneratorService._generate_ocr_track_pdf()` to use `TextRegionRenderer` -- [x] Ensure raw OCR regions are loaded correctly via `load_raw_ocr_regions()` - -## Phase 3: Image/Chart/Formula Support - -- [x] Add image element type detection (`figure`, `image`, `chart`, `seal`, `formula`) -- [x] Render image elements from UnifiedDocument to PDF -- [x] Handle image path resolution (result_dir, imgs/ subdirectory) -- [x] Coordinate transformation for image placement - -## Phase 4: Text Straightening & Overlap Avoidance - -- [x] Add rotation straightening threshold (default 10°) - - Small rotation angles (< 10°) are treated as 0° for clean output - - Only significant rotations (e.g., 90°) are preserved -- [x] Add IoA (Intersection over Area) overlap detection - - IoA threshold default 0.3 (30% overlap triggers skip) - - Text regions overlapping with images/charts are skipped -- [x] Collect exclusion zones from image elements -- [x] Pass exclusion zones to text renderer - -## Phase 5: Chart Axis Label Deduplication - -- [x] Add `is_axis_label()` method to detect axis labels - - Y-axis: Vertical text immediately left of chart - - X-axis: Horizontal text immediately below chart -- [x] Add `is_near_zone()` method for proximity checking -- [x] Position-aware deduplication in `render_text_region()` - - Collect texts inside zones + axis labels - - Skip matching text only if near zone or is axis label - - Preserve matching text far from zones (e.g., table values) -- [x] Test results: - - "Temperature, C" and "Syringe Thaw Time, Minutes" correctly skipped - - Table values like "10" at top of page correctly rendered - - Page 2: 128/148 text regions rendered (12 overlap + 8 dedupe) - -## Phase 6: Testing - -- [x] Test with scan.pdf task (064e2d67-338c-4e54-b005-204c3b76fe63) - - Page 2: Chart image rendered, axis labels deduplicated - - PDF is searchable and selectable - - Text is properly straightened (no skew artifacts) -- [ ] Compare output quality vs original scan visually -- [ ] Test with documents containing seals/formulas diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/proposal.md b/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/proposal.md deleted file mode 100644 index 979ebc6..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/proposal.md +++ /dev/null @@ -1,59 +0,0 @@ -# Change: Simplify Frontend Export Options - -## Why - -The current frontend has accumulated export options that are no longer needed or rarely used. Following the "Simple OCR" architecture change, we need to streamline the user interface by: - -1. Removing redundant export formats that add complexity without significant user value -2. Focusing on the most useful output formats (PDF) -3. Simplifying the translation download options - -## What Changes - -### TaskDetailPage Changes - -**Download Options - Remove:** -- JSON download button -- UnifiedDocument (統一格式) download button -- Markdown download button - -**Download Options - Keep:** -- 版面 PDF (Layout PDF) -- 流式 PDF (Reflow PDF) - -**Translation Options - Remove:** -- Download translation JSON button -- Download translated Layout PDF option - -**Translation Options - Keep:** -- Download translated Reflow PDF (流式 PDF) - -**Statistics Section - Keep All:** -- 處理時間 (Processing time) -- 頁數 (Page count) -- 文本區域 (Text regions) -- 表格 (Tables) -- 圖片 (Images) -- 平均置信度 (Average confidence) - -### Components - Keep All -- LayoutModelSelector -- PreprocessingSettings -- PreprocessingPreview -- ProcessingTrackSelector - -### Pages to Review (Out of Scope) -- SettingsPage (Export rules) - May need separate review -- ResultsPage - May be unused, needs verification - -## Impact - -- **Affected files**: `frontend/src/pages/TaskDetailPage.tsx` -- **User experience**: Simplified interface with fewer but more relevant options -- **Backend**: No changes required (endpoints remain available for API users) - -## Migration - -- No data migration required -- Frontend-only changes -- Backend endpoints remain unchanged for API compatibility diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/specs/result-export/spec.md b/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/specs/result-export/spec.md deleted file mode 100644 index 731aec8..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/specs/result-export/spec.md +++ /dev/null @@ -1,24 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Export Interface - -The Export interface in TaskDetailPage SHALL provide streamlined download options focusing on PDF formats. - -#### Scenario: Download options for completed tasks -- **WHEN** viewing a completed task in TaskDetailPage -- **THEN** the download section SHALL display only two buttons: "版面 PDF" and "流式 PDF" -- **AND** JSON, UnifiedDocument, and Markdown download buttons SHALL NOT be displayed -- **AND** the download grid SHALL use a 2-column layout - -#### Scenario: Translation download options -- **WHEN** viewing completed translations in TaskDetailPage -- **THEN** each translation item SHALL display only a "流式 PDF" download button -- **AND** translation JSON download button SHALL NOT be displayed -- **AND** Layout PDF option for translations SHALL NOT be displayed -- **AND** delete translation button SHALL remain available - -#### Scenario: Backend API remains unchanged -- **WHEN** external clients call download endpoints directly -- **THEN** JSON, Markdown, and UnifiedDocument endpoints SHALL still function -- **AND** translated Layout PDF endpoint SHALL still function -- **AND** no backend changes are required for this frontend simplification diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/tasks.md b/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/tasks.md deleted file mode 100644 index c055524..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-export-options/tasks.md +++ /dev/null @@ -1,57 +0,0 @@ -# Tasks: Simplify Frontend Export Options - -## 1. TaskDetailPage - Download Section - -- [x] 1.1 Remove JSON download button - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Remove: Button with `handleDownloadJSON` onClick - - Remove: `handleDownloadJSON` function (lines 245-261) - -- [x] 1.2 Remove UnifiedDocument download button - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Remove: Button with `handleDownloadUnified` onClick - - Remove: `handleDownloadUnified` function (lines 263-279) - -- [x] 1.3 Remove Markdown download button - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Remove: Button with `handleDownloadMarkdown` onClick - - Remove: `handleDownloadMarkdown` function (lines 227-243) - -- [x] 1.4 Update download grid layout - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Change: Grid from 5 columns to 2 columns (only Layout PDF and Reflow PDF) - - Update: `grid-cols-2 md:grid-cols-5` → `grid-cols-2` - -## 2. TaskDetailPage - Translation Section - -- [x] 2.1 Remove translation JSON download button - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Remove: Button with `handleDownloadTranslation` onClick in translation list - - Remove: `handleDownloadTranslation` function (lines 322-338) - -- [x] 2.2 Simplify translated PDF download (remove Layout option) - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Change: Remove Select dropdown for PDF format - - Change: Replace with single "流式 PDF" download button - - Keep: `handleDownloadTranslatedPdf` function (always use 'reflow' format) - -## 3. Cleanup - Remove Unused Imports - -- [x] 3.1 Remove unused Lucide icons - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Removed: `FileJson`, `Database`, `FileOutput` - - Keep: Icons still in use - -## 4. Verification - -- [ ] 4.1 Verify Layout PDF download works - - Test: Click "版面 PDF" button - - Expected: PDF downloads with preserved layout - -- [ ] 4.2 Verify Reflow PDF download works - - Test: Click "流式 PDF" button - - Expected: PDF downloads with flowing text - -- [ ] 4.3 Verify translated Reflow PDF download works - - Test: Complete a translation, then click download - - Expected: Translated PDF downloads in reflow format diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/proposal.md b/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/proposal.md deleted file mode 100644 index e5aca50..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/proposal.md +++ /dev/null @@ -1,25 +0,0 @@ -# Change: 簡化前端 OCR 配置選項 - -## Why -OCR track 已改為使用 simple OCR 模式,不再需要前端的複雜配置選項(如表格偵測模式、OCR 預設、進階參數等)。這些配置增加了使用者的認知負擔,且不再影響實際處理結果。 - -## What Changes -- **BREAKING** 移除前端的 OCR 處理預設選擇器 (`OCRPresetSelector`) -- **BREAKING** 移除前端的表格偵測配置選擇器 (`TableDetectionSelector`) -- **BREAKING** 移除前端相關的 TypeScript 類型定義 (`OCRPreset`, `OCRConfig`, `TableDetectionConfig`, `TableParsingMode` 等) -- 保留版面模型選擇功能 (`LayoutModelSelector`): `chinese | default | cdla` -- 保留影像前處理配置功能 (`PreprocessingSettings`): auto/manual/disabled 模式及相關參數 -- 簡化後端 API 的 `ProcessingOptions`,移除不再使用的參數 - -## Impact -- Affected specs: `ocr-processing` -- Affected code: - - **前端需刪除的檔案**: - - `frontend/src/components/OCRPresetSelector.tsx` - - `frontend/src/components/TableDetectionSelector.tsx` - - **前端需修改的檔案**: - - `frontend/src/types/apiV2.ts` - 移除未使用的類型定義 - - `frontend/src/pages/ProcessingPage.tsx` - 移除已註解的相關 import 和邏輯 - - **後端需修改的檔案**: - - `backend/app/schemas/task.py` - 移除 `ProcessingOptions` 中的 `ocr_preset`, `ocr_config`, `table_detection` 欄位 - - `backend/app/routers/tasks.py` - 清理對應的參數處理邏輯 diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/specs/ocr-processing/spec.md b/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/specs/ocr-processing/spec.md deleted file mode 100644 index a3c89f4..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/specs/ocr-processing/spec.md +++ /dev/null @@ -1,127 +0,0 @@ -# ocr-processing Specification Delta - -## REMOVED Requirements - -### Requirement: OCR Preset Selection -**Reason**: OCR track 已改為 simple OCR 模式,不再需要前端提供複雜的預設配置。後端統一使用預設參數處理。 -**Migration**: 移除前端 `OCRPresetSelector` 組件及相關類型定義。後端自動使用最佳預設配置。 - -### Requirement: Table Detection Configuration -**Reason**: 表格偵測設定(有框線/無框線表格開關、區域偵測開關)不再需要由前端控制。後端統一使用預設的表格偵測策略。 -**Migration**: 移除前端 `TableDetectionSelector` 組件及 `TableDetectionConfig` 類型。後端使用內建預設值。 - -### Requirement: OCR Advanced Parameters -**Reason**: 進階 OCR 參數(如 `table_parsing_mode`, `layout_threshold`, `enable_chart_recognition` 等)不再需要前端配置。 -**Migration**: 移除前端 `OCRConfig` 類型及相關 UI。後端固定使用 simple OCR 模式的預設參數。 - -## MODIFIED Requirements - -### Requirement: Layout Model Selection -The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning. - -#### Scenario: User selects Chinese document model -- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices) -- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S) -- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model -- **AND** the model SHALL be optimized for 23 Chinese document element types -- **AND** table and form detection accuracy SHALL be improved over the default model - -#### Scenario: User selects standard model for English documents -- **GIVEN** a user is processing English academic papers or reports -- **WHEN** the user selects "Standard Model" (PubLayNet-based) -- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model -- **AND** the model SHALL be optimized for English document layouts - -#### Scenario: User selects CDLA model for specialized Chinese layout -- **GIVEN** a user is processing Chinese documents with complex layouts -- **WHEN** the user selects "CDLA Model" -- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model -- **AND** the model SHALL provide specialized Chinese document layout analysis - -#### Scenario: Layout model is sent via API request -- **GIVEN** a frontend application with model selection UI -- **WHEN** the user starts task processing with a selected model -- **THEN** the frontend SHALL send the model choice in the request body: - ```json - POST /api/v2/tasks/{task_id}/start - { - "use_dual_track": true, - "force_track": "ocr", - "language": "ch", - "layout_model": "chinese" - } - ``` -- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model -- **AND** the frontend SHALL NOT send `ocr_preset`, `ocr_config`, or `table_detection` parameters - -#### Scenario: Default model when not specified -- **GIVEN** an API request without `layout_model` parameter -- **WHEN** the task is started -- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model -- **AND** processing SHALL work correctly without requiring model selection - -#### Scenario: Invalid model name is rejected -- **GIVEN** a request with an invalid `layout_model` value -- **WHEN** the user sends `layout_model: "invalid_model"` -- **THEN** the API SHALL return 422 Validation Error -- **AND** provide a clear error message listing valid model options - -### Requirement: Layout Model Selection UI -The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option. - -#### Scenario: Model options are displayed with descriptions -- **GIVEN** the model selection UI is displayed -- **WHEN** the user views the available options -- **THEN** the UI SHALL show the following options: - - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices - - "Standard Model" - for English academic papers, reports - - "CDLA Model" - for specialized Chinese layout analysis -- **AND** each option SHALL have a brief description of its use case - -#### Scenario: Chinese model is selected by default -- **GIVEN** the user opens the task processing interface -- **WHEN** the model selection is displayed -- **THEN** "Chinese Document Model" SHALL be pre-selected as the default -- **AND** the user MAY change the selection before starting processing - -#### Scenario: Model selection is visible only for OCR track -- **GIVEN** a document processing interface -- **WHEN** the user selects processing track -- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected -- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3) - -#### Scenario: Simplified configuration options -- **GIVEN** the OCR track processing interface -- **WHEN** the user configures processing options -- **THEN** the UI SHALL only show: - - Layout model selection (chinese/default/cdla) - - Image preprocessing settings (auto/manual/disabled) -- **AND** SHALL NOT show: - - OCR preset selection - - Table detection configuration - - Advanced OCR parameters - -### Requirement: Simplified Processing Options API -The backend API SHALL accept a simplified `ProcessingOptions` schema without complex OCR configuration parameters. - -#### Scenario: API accepts minimal configuration -- **GIVEN** a start task API request -- **WHEN** the request body contains: - ```json - { - "use_dual_track": true, - "force_track": "ocr", - "language": "ch", - "layout_model": "chinese", - "preprocessing_mode": "auto" - } - ``` -- **THEN** the API SHALL accept the request -- **AND** process the task using backend default values for all other parameters - -#### Scenario: Legacy parameters are ignored -- **GIVEN** a start task API request with legacy parameters -- **WHEN** the request contains `ocr_preset`, `ocr_config`, or `table_detection` -- **THEN** the API SHALL ignore these parameters -- **AND** use backend default values instead -- **AND** NOT return an error (backward compatibility) diff --git a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/tasks.md b/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/tasks.md deleted file mode 100644 index c81c80a..0000000 --- a/openspec/changes/archive/2025-12-11-simplify-frontend-ocr-config/tasks.md +++ /dev/null @@ -1,51 +0,0 @@ -# Tasks: 簡化前端 OCR 配置選項 - -## 1. 前端清理 - -### 1.1 移除未使用的組件 -- [x] 1.1.1 刪除 `frontend/src/components/OCRPresetSelector.tsx` -- [x] 1.1.2 刪除 `frontend/src/components/TableDetectionSelector.tsx` - -### 1.2 清理 TypeScript 類型定義 -- [x] 1.2.1 從 `frontend/src/types/apiV2.ts` 移除以下類型: - - `TableDetectionConfig` (第 121-125 行) - - `OCRPreset` (第 131 行) - - `TableParsingMode` (第 140 行) - - `OCRConfig` (第 146-166 行) - - `OCRPresetInfo` (第 171-177 行) -- [x] 1.2.2 從 `ProcessingOptions` interface 移除以下欄位: - - `table_detection` - - `ocr_preset` - - `ocr_config` - -### 1.3 清理 ProcessingPage -- [x] 1.3.1 確認 `frontend/src/pages/ProcessingPage.tsx` 中沒有引用已移除的類型或組件 -- [x] 1.3.2 移除相關的註解說明(如果有)- 保留說明性註解 - -## 2. 後端清理 - -### 2.1 清理 Schema 定義 -- [x] 2.1.1 從 `backend/app/schemas/task.py` 移除未使用的 Enum 和 Model: - - `TableDetectionConfig` - - `OCRPresetEnum` - - `TableParsingModeEnum` - - `OCRConfig` - - `OCR_PRESET_CONFIGS` -- [x] 2.1.2 從 `ProcessingOptions` 移除以下欄位: - - `table_detection` - - `ocr_preset` - - `ocr_config` - -### 2.2 清理 API 端點邏輯 -- [x] 2.2.1 檢查 `backend/app/routers/tasks.py` 中的 `start_task` 端點,移除對已刪除欄位的處理 -- [x] 2.2.2 更新 `process_task_ocr` 函數簽名和呼叫 - -### 2.3 清理 Service 層 -- [x] 2.3.1 檢查 `backend/app/services/ocr_service.py`,確認沒有依賴已移除的配置項 - - 注意:ocr_service.py 保留這些參數作為可選項,使用預設值處理。這是正確的設計,保持後端彈性。 - -## 3. 驗證 - -- [x] 3.1 確認 TypeScript 編譯無新錯誤(與本次變更相關的錯誤) -- [ ] 3.2 確認後端 API 仍正常運作(需手動測試) -- [ ] 3.3 測試上傳 -> 處理 -> 結果查看的完整流程(需手動測試) diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/design.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/design.md deleted file mode 100644 index d6a1f09..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/design.md +++ /dev/null @@ -1,130 +0,0 @@ -# Design: Unify Direct Track PDF Rendering - -## Context - -The Tool_OCR system generates "Layout PDF" files that preserve the original document appearance while maintaining extractable text. Currently, Direct Track (editable PDFs and Office documents) uses element-by-element rendering, which causes: -- Z-order conflicts (text behind images) -- Missing vector graphics (chart bars, gradients) -- White text becoming invisible on dark backgrounds - -## Goals / Non-Goals - -### Goals -- Visual fidelity: Layout PDF matches source document exactly -- Text extractability: All text remains searchable/selectable for translation -- Unified logic: Same rendering approach for all Direct Track documents -- Chart handling: Chart-internal text excluded from translation layer - -### Non-Goals -- Editable text in Layout PDF (translation creates separate reflow PDF) -- Reducing file size (trade-off for visual fidelity) -- OCR Track changes (only affects Direct Track) - -## Decisions - -### Decision 1: Use Background Image + Invisible Text Layer - -**What**: Render each source PDF page as a full-page background image, then overlay invisible text. - -**Why**: -- Preserves ALL visual content (vector graphics, gradients, complex layouts) -- Invisible text (PDF Rendering Mode 3) allows text selection without visual overlap -- Simplifies z-order handling (just one image layer + one text layer) - -**Implementation**: -```python -# Render source page as background -mat = fitz.Matrix(2.0, 2.0) # 2x resolution -pix = source_page.get_pixmap(matrix=mat, alpha=False) -pdf_canvas.drawImage(bg_img, 0, 0, width=page_width, height=page_height) - -# Set invisible text mode -pdf_canvas._code.append('3 Tr') # Text render mode: invisible - -# Draw text elements (invisible but selectable) -for elem in text_elements: - if not is_inside_chart_region(elem): - draw_text_element(elem) - -pdf_canvas._code.append('0 Tr') # Reset to normal -``` - -### Decision 2: Add CHART to regions_to_avoid - -**What**: Chart-internal text elements are excluded from the invisible text layer. - -**Why**: -- Chart axis labels, legends already visible in background image -- These texts typically don't need translation -- Prevents duplicate text extraction for translation - -**Implementation**: -```python -# In element classification loop -if element.type == ElementType.CHART: - image_elements.append(element) - regions_to_avoid.append(element) # Exclude chart region from text layer -``` - -### Decision 3: Apply to ALL Direct Track Documents - -**What**: Use background image rendering for both Office documents and native PDFs. - -**Why**: -- Consistent handling eliminates edge cases -- Chart text overlap affects both document types -- Office detection (LibreOffice producer) is unreliable for some PDFs - -**Detection logic removed**: -```python -# OLD: Only for Office documents -is_office_document = 'LibreOffice' in producer or filename.endswith('.pptx') - -# NEW: All Direct Track uses background rendering -if self.current_processing_track == ProcessingTrack.DIRECT: - render_background_image() -``` - -## Architecture - -``` -┌─────────────────────────────────────────────────────────────┐ -│ PDF Generation Flow │ -├─────────────────────────────────────────────────────────────┤ -│ │ -│ Source PDF ──► PyMuPDF ──► Page Pixmap (2x) ──► Background │ -│ │ │ -│ ▼ │ -│ Extract Text ──► Filter Chart Regions │ -│ │ │ -│ ▼ │ -│ Invisible Text Layer (Mode 3) ──► Overlay │ -│ │ -│ Result: Background Image + Invisible Searchable Text │ -│ │ -└─────────────────────────────────────────────────────────────┘ -``` - -## Risks / Trade-offs - -| Risk | Impact | Mitigation | -|------|--------|------------| -| Larger file size (~2MB/page) | Storage, download time | Accept trade-off for visual fidelity | -| Slightly slower generation | User wait time | Acceptable for quality improvement | -| Chart text not translatable | Feature limitation | Document as expected behavior | -| Source PDF required | Can't regenerate without source | Store source PDF reference in task | - -## File Size Estimation - -| Document | Pages | Current Size | New Size (est.) | -|----------|-------|--------------|-----------------| -| PPT (25 pages) | 25 | ~1.5 MB | ~43 MB | -| PDF (3 pages) | 3 | ~68 KB | ~6 MB | - -## Open Questions - -1. Should we provide a "lightweight" option that skips background rendering for simple PDFs? - - **Decision**: No, keep unified approach for consistency - -2. Should chart text be optionally included in translation? - - **Decision**: No, chart labels rarely need translation and would require complex masking diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/proposal.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/proposal.md deleted file mode 100644 index 4c5df70..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/proposal.md +++ /dev/null @@ -1,54 +0,0 @@ -# Change: Unify Direct Track PDF Rendering with Background Image + Invisible Text Layer - -## Why - -Direct Track PDF generation currently has visual rendering issues: -1. **Chart text overlap**: Text elements extracted from PDF text layer (e.g., "Temperature, °C") overlap with chart images -2. **Z-order problems**: White text on dark backgrounds becomes invisible when rendered incorrectly -3. **Office document issues**: PPT/DOC/XLS converted PDFs lose visual fidelity (vector graphics, gradients) - -The root cause is that Direct Track tries to render individual elements (text, images, tables) separately, which leads to z-order conflicts and missing visual content. - -## What Changes - -### Backend Changes - -1. **Unified Background Image Rendering for All Direct Track** - - Render source PDF page as full-page background image (2x resolution) - - Draw invisible text layer on top (PDF Text Rendering Mode 3) - - Text remains searchable/extractable but doesn't visually overlap - -2. **Chart Region Exclusion** - - Add `CHART` element type to `regions_to_avoid` - - Chart-internal text (axis labels, legends) will NOT be in invisible text layer - - These texts are already visible in the background image and don't need translation - -3. **Skip Element Rendering When Background Exists** - - When background image is rendered, skip individual image/table rendering - - Only draw invisible text layer for searchability and translation extraction - -### Frontend Considerations - -1. **No UI Changes Required for Layout PDF** - - Layout PDF generation is automatic, no user options needed - - Visual output will match source PDF exactly - -2. **Translation Flow Clarification** - - Layout PDF: Background image + invisible text (for preview) - - Translated PDF: Reflow layout with real visible text (page-by-page) - - Chart text excluded from translation (already in background image) - -## Impact - -- **Affected specs**: document-processing, result-export, translation -- **Affected code**: - - `backend/app/services/pdf_generator_service.py` (main changes) - - `backend/app/services/direct_extraction_engine.py` (chart detection) -- **File size**: Output PDF will be larger due to embedded page images (~2MB per page at 2x resolution) -- **Processing time**: Slight increase for page rendering - -## Migration - -- No database changes required -- No API changes required -- Existing tasks can be re-exported with new PDF generation logic diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/document-processing/spec.md deleted file mode 100644 index fe667b6..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/document-processing/spec.md +++ /dev/null @@ -1,43 +0,0 @@ -## ADDED Requirements - -### Requirement: Direct Track Background Image Rendering - -The system SHALL render Direct Track PDF output using a full-page background image with an invisible text overlay to preserve visual fidelity while maintaining text extractability. - -#### Scenario: Render Direct Track PDF with background image -- **WHEN** generating Layout PDF for a Direct Track document -- **THEN** the system SHALL render each source PDF page as a full-page background image at 2x resolution -- **AND** overlay invisible text elements using PDF Text Rendering Mode 3 -- **AND** the invisible text SHALL be positioned at original coordinates for accurate selection - -#### Scenario: Handle Office documents (PPT, DOC, XLS) -- **WHEN** processing an Office document converted to PDF -- **THEN** the system SHALL use the same background image + invisible text approach -- **AND** preserve all visual elements including vector graphics, gradients, and complex layouts -- **AND** the converted PDF in result directory SHALL be used as background source - -#### Scenario: Handle native editable PDFs -- **WHEN** processing a native PDF through Direct Track -- **THEN** the system SHALL use the source PDF for background rendering -- **AND** apply the same invisible text overlay approach -- **AND** chart regions SHALL be excluded from the text layer - -### Requirement: Chart Region Text Exclusion - -The system SHALL exclude text elements within chart regions from the invisible text layer to prevent duplicate content and unnecessary translation. - -#### Scenario: Detect chart regions -- **WHEN** classifying page elements for Direct Track -- **THEN** the system SHALL identify elements with type CHART -- **AND** add chart bounding boxes to regions_to_avoid list - -#### Scenario: Exclude chart-internal text from invisible layer -- **WHEN** rendering invisible text layer -- **THEN** the system SHALL skip text elements whose bounding boxes overlap with chart regions -- **AND** chart axis labels, legends, and data labels SHALL NOT be in the invisible text layer -- **AND** these texts remain visible in the background image - -#### Scenario: Chart text not available for translation -- **WHEN** extracting text for translation from a Direct Track document -- **THEN** chart-internal text SHALL NOT be included in translatable elements -- **AND** this is expected behavior as chart labels typically don't require translation diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/result-export/spec.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/result-export/spec.md deleted file mode 100644 index 3999ba7..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/result-export/spec.md +++ /dev/null @@ -1,36 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Enhanced PDF Export with Layout Preservation - -The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity. - -#### Scenario: Export PDF from direct extraction track -- **WHEN** exporting PDF from a direct-extraction processed document -- **THEN** the system SHALL render source PDF pages as full-page background images at 2x resolution -- **AND** overlay invisible text elements using PDF Text Rendering Mode 3 -- **AND** text SHALL remain selectable and searchable despite being invisible -- **AND** visual output SHALL match source document exactly - -#### Scenario: Export PDF from OCR track with full structure -- **WHEN** exporting PDF from OCR-processed document -- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types -- **AND** render tables with proper cell boundaries -- **AND** maintain reading order from parsing_res_list - -#### Scenario: Handle coordinate transformations correctly -- **WHEN** generating PDF from UnifiedDocument -- **THEN** system SHALL use explicit page dimensions from OCR results (not inferred from bounding boxes) -- **AND** correctly transform Y-axis coordinates from top-left (OCR) to bottom-left (PDF/ReportLab) origin -- **AND** prevent vertical flipping or position misalignment errors - -#### Scenario: Direct Track PDF file size increase -- **WHEN** generating Layout PDF for Direct Track documents -- **THEN** the system SHALL accept increased file size due to embedded page images -- **AND** approximately 1-2 MB per page at 2x resolution is expected -- **AND** this trade-off is accepted for improved visual fidelity - -#### Scenario: Chart elements excluded from text layer -- **WHEN** generating Layout PDF containing charts -- **THEN** the system SHALL NOT include chart-internal text in the invisible text layer -- **AND** chart visuals SHALL be preserved in the background image -- **AND** chart text SHALL NOT be available for text selection or translation diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/translation/spec.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/translation/spec.md deleted file mode 100644 index 712bce9..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/specs/translation/spec.md +++ /dev/null @@ -1,46 +0,0 @@ -## ADDED Requirements - -### Requirement: Translation Output as Reflow PDF - -The system SHALL generate translated documents as reflow-layout PDFs with real visible text, separate from the Layout PDF which uses background images. - -#### Scenario: Generate translated PDF with reflow layout -- **WHEN** translation is completed for a document -- **THEN** the system SHALL generate a new PDF with translated text -- **AND** the translated PDF SHALL use reflow layout (not background image) -- **AND** text SHALL be real visible text, not invisible overlay -- **AND** page breaks SHALL correspond to original document pages - -#### Scenario: Maintain page correspondence in translated output -- **WHEN** generating translated PDF -- **THEN** content from original page 1 SHALL appear in translated page 1 -- **AND** content from original page 2 SHALL appear in translated page 2 -- **AND** each page may have different content length but maintains page boundaries - -#### Scenario: Chart text excluded from translation -- **WHEN** extracting text for translation from Direct Track documents -- **THEN** text elements within chart regions SHALL NOT be included -- **AND** chart labels, axis text, and legends SHALL remain untranslated -- **AND** this is expected behavior documented for users - -### Requirement: Dual PDF Output Concept - -The system SHALL maintain clear separation between Layout PDF (preview) and Translated PDF (output). - -#### Scenario: Layout PDF for preview -- **WHEN** user views a processed document before translation -- **THEN** the Layout PDF SHALL be displayed -- **AND** Layout PDF preserves exact visual appearance of source -- **AND** text is invisible overlay for extraction purposes only - -#### Scenario: Translated PDF for final output -- **WHEN** user requests translated document -- **THEN** the Translated PDF SHALL be generated -- **AND** Translated PDF uses reflow layout with visible translated text -- **AND** original visual styling is not preserved (text-focused output) - -#### Scenario: Both PDFs available after translation -- **WHEN** translation is completed -- **THEN** both Layout PDF and Translated PDF SHALL be available for download -- **AND** user can choose which version to download -- **AND** Layout PDF remains unchanged after translation diff --git a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/tasks.md b/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/tasks.md deleted file mode 100644 index 72a05e7..0000000 --- a/openspec/changes/archive/2025-12-11-unify-direct-track-pdf-rendering/tasks.md +++ /dev/null @@ -1,78 +0,0 @@ -# Tasks: Unify Direct Track PDF Rendering - -## 1. Backend - PDF Generator Service - -- [x] 1.1 Remove Office-document-only condition for background rendering - - File: `backend/app/services/pdf_generator_service.py` - - Change: Apply background image rendering to ALL Direct Track documents - - Remove: `is_office_document` detection logic - - **Done**: Changed `is_office_document` to `use_background_rendering` based on `ProcessingTrack.DIRECT` - -- [x] 1.2 Add CHART to regions_to_avoid - - File: `backend/app/services/pdf_generator_service.py` - - Change: Include `ElementType.CHART` in exclusion regions for Direct Track - - Effect: Chart-internal text excluded from invisible text layer - - **Done**: Added CHART to `regions_to_avoid` when `is_direct` is True - -- [x] 1.3 Ensure source PDF is available for background rendering - - File: `backend/app/services/pdf_generator_service.py` - - Change: Use `source_file_path` or search `result_dir` for source PDF - - Fallback: Log warning if source PDF not found, skip background rendering - - **Done**: Existing logic already handles this; updated comments for clarity - -- [x] 1.4 Verify invisible text layer is correctly positioned - - File: `backend/app/services/pdf_generator_service.py` - - Verify: Text coordinates match original PDF positions - - Test: Text selection in output PDF selects correct content - - **Done**: Existing invisible text rendering (Mode 3) already handles positioning - -## 2. Backend - Testing - -- [x] 2.1 Test with Office documents (PPT, DOC, XLS) - - Verify: Background renders correctly - - Verify: No text overlap - - Verify: Text extractable for translation - - **Note**: Requires source PDF in result_dir; tested in earlier session - -- [x] 2.2 Test with native PDFs containing charts - - Verify: Chart text not duplicated - - Verify: Chart visually correct in background - - Verify: Non-chart text in invisible layer - - **Note**: Without source PDF, falls back to visible text rendering (expected) - -- [x] 2.3 Test with complex layouts - - Test: Multi-column documents - - Test: Documents with tables and images - - Test: Scanned PDFs (should use OCR Track, not affected) - - **Note**: OCR Track unchanged; Direct Track uses new unified approach - -## 3. Frontend - Verification - -- [x] 3.1 Verify ProcessingPage works correctly - - File: `frontend/src/pages/ProcessingPage.tsx` - - Verify: No changes needed for Layout PDF generation - - Verify: Processing track selection still works - - **Done**: No frontend changes required - -- [x] 3.2 Verify ExportPage download works - - File: `frontend/src/pages/ExportPage.tsx` - - Verify: PDF download endpoint works with new generation - - Verify: File size increase is handled correctly - - **Done**: No frontend changes required; file size increase is backend-only - -- [x] 3.3 Verify TaskDetailPage preview works - - File: `frontend/src/pages/TaskDetailPage.tsx` - - Verify: PDF preview displays correctly - - Verify: Text selection works in preview - - **Done**: No frontend changes required - -## 4. Documentation - -- [x] 4.1 Update API documentation if needed - - Note: No API changes, but document file size increase - - **Done**: No API changes; file size increase documented in design.md - -- [x] 4.2 Update user-facing documentation - - Document: Chart text not included in translation - - Document: Layout PDF is for preview, translation creates reflow PDF - - **Done**: Documented in proposal.md and design.md diff --git a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md deleted file mode 100644 index 84ca1bd..0000000 --- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md +++ /dev/null @@ -1,234 +0,0 @@ -# Design: cell_boxes-First Table Rendering - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ Table Rendering Pipeline │ -├─────────────────────────────────────────────────────────────────┤ -│ │ -│ Input: table_element │ -│ ├── cell_boxes: [[x0,y0,x1,y1], ...] (from PP-StructureV3)│ -│ ├── html: "...
" (from PP-StructureV3)│ -│ └── bbox: [x0, y0, x1, y1] (table boundary) │ -│ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ Step 1: Grid Inference from cell_boxes │ │ -│ │ │ │ -│ │ cell_boxes → cluster by Y → rows │ │ -│ │ → cluster by X → cols │ │ -│ │ → build grid[row][col] = cell_bbox │ │ -│ └────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ Step 2: Content Extraction from HTML │ │ -│ │ │ │ -│ │ html → parse → extract text list in reading order │ │ -│ │ → flatten colspan/rowspan → [text1, text2, ...] │ │ -│ └────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ Step 3: Content-to-Cell Mapping │ │ -│ │ │ │ -│ │ Option A: Sequential assignment (text[i] → cell[i]) │ │ -│ │ Option B: Coordinate matching (text_bbox ∩ cell_bbox) │ │ -│ │ Option C: Row-by-row assignment │ │ -│ └────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ Step 4: PDF Rendering │ │ -│ │ │ │ -│ │ For each cell in grid: │ │ -│ │ 1. Draw cell border at cell_bbox coordinates │ │ -│ │ 2. Render text content inside cell │ │ -│ └────────────────────────────────────────────────────────────┘ │ -│ │ -│ Output: Table rendered in PDF with accurate cell boundaries │ -└─────────────────────────────────────────────────────────────────┘ -``` - -## Detailed Design - -### 1. Grid Inference Algorithm - -```python -def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0): - """ - Infer row/column grid structure from cell_boxes coordinates. - - Args: - cell_boxes: List of [x0, y0, x1, y1] coordinates - threshold: Clustering threshold for row/column grouping - - Returns: - grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info - row_heights: List of row heights - col_widths: List of column widths - """ - # 1. Extract all Y-centers and X-centers - y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes] - x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes] - - # 2. Cluster Y-centers into rows - rows = cluster_values(y_centers, threshold) # Returns sorted list of row indices - - # 3. Cluster X-centers into columns - cols = cluster_values(x_centers, threshold) # Returns sorted list of col indices - - # 4. Assign each cell_box to (row, col) - grid = {} - for i, cb in enumerate(cell_boxes): - row = find_cluster(y_centers[i], rows) - col = find_cluster(x_centers[i], cols) - grid[(row, col)] = { - 'bbox': cb, - 'index': i - } - - # 5. Calculate actual widths/heights from boundaries - row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)] - col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)] - - return grid, row_heights, col_widths -``` - -### 2. Content Extraction - -The HTML content extraction should handle colspan/rowspan by flattening: - -```python -def extract_cell_contents(html: str) -> List[str]: - """ - Extract cell text contents from HTML in reading order. - Expands colspan/rowspan into repeated empty strings. - - Returns: - List of text strings, one per logical cell position - """ - parser = HTMLTableParser() - parser.feed(html) - - contents = [] - for row in parser.tables[0]['rows']: - for cell in row['cells']: - contents.append(cell['text']) - # For colspan > 1, add empty strings for merged cells - for _ in range(cell.get('colspan', 1) - 1): - contents.append('') - - return contents -``` - -### 3. Content-to-Cell Mapping Strategy - -**Recommended: Row-by-row Sequential Assignment** - -Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order: - -```python -def map_content_to_grid(grid, contents, num_rows, num_cols): - """ - Map extracted content to grid cells row by row. - """ - content_idx = 0 - for row in range(num_rows): - for col in range(num_cols): - if (row, col) in grid: - if content_idx < len(contents): - grid[(row, col)]['content'] = contents[content_idx] - content_idx += 1 - else: - grid[(row, col)]['content'] = '' - - return grid -``` - -### 4. PDF Rendering Integration - -Modify `pdf_generator_service.py` to use cell_boxes-first path: - -```python -def draw_table_region(self, ...): - cell_boxes = table_element.get('cell_boxes', []) - html_content = table_element.get('content', '') - - if cell_boxes and settings.table_rendering_prefer_cellboxes: - # Try cell_boxes-first approach - grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes) - - if grid: - # Extract content from HTML - contents = extract_cell_contents(html_content) - - # Map content to grid - grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths)) - - # Render using cell_boxes coordinates - success = self._render_table_from_grid( - pdf_canvas, grid, row_heights, col_widths, - page_height, scale_w, scale_h - ) - - if success: - return # Done - - # Fallback to existing HTML-based rendering - self._render_table_from_html(...) -``` - -## Configuration - -```python -# config.py -class Settings: - # Table rendering strategy - table_rendering_prefer_cellboxes: bool = Field( - default=True, - description="Use cell_boxes coordinates as primary table structure source" - ) - - table_cellboxes_row_threshold: float = Field( - default=15.0, - description="Y-coordinate threshold for row clustering" - ) - - table_cellboxes_col_threshold: float = Field( - default=15.0, - description="X-coordinate threshold for column clustering" - ) -``` - -## Edge Cases - -### 1. Empty cell_boxes -- **Condition**: `cell_boxes` is empty or None -- **Action**: Fall back to HTML-based rendering - -### 2. Content Count Mismatch -- **Condition**: HTML has more/fewer cells than cell_boxes grid -- **Action**: Fill available cells, leave extras empty, log warning - -### 3. Overlapping cell_boxes -- **Condition**: Multiple cell_boxes map to same grid position -- **Action**: Use first one, log warning - -### 4. Single-cell Tables -- **Condition**: Only 1 cell_box detected -- **Action**: Render as single-cell table (valid case) - -## Testing Plan - -1. **Unit Tests** - - `test_infer_grid_from_cellboxes`: Various cell_box configurations - - `test_content_mapping`: Content assignment scenarios - -2. **Integration Tests** - - `test_scan_pdf_table_7`: Verify the problematic table renders correctly - - `test_existing_tables`: No regression on previously working tables - -3. **Visual Verification** - - Compare PDF output before/after for `scan.pdf` - - Check table alignment and text placement diff --git a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md deleted file mode 100644 index d7fccd0..0000000 --- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md +++ /dev/null @@ -1,75 +0,0 @@ -# Proposal: Use cell_boxes as Primary Table Rendering Source - -## Summary - -Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid. - -## Problem Statement - -### Current Issue - -When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics: - -**Table 7 (Element 7)**: -- `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering) -- HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan - -This **grid mismatch** causes: -1. `_compute_table_grid_from_cell_boxes()` returns `None, None` -2. PDF generator falls back to ReportLab Table with equal column distribution -3. Table renders with incorrect column widths, causing visual misalignment - -### Root Cause - -PP-StructureV3 sometimes merges multiple visual tables into one large table region: -- The cell_boxes accurately detect individual cell boundaries -- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes -- Current logic requires exact grid match, which fails for complex merged tables - -## Proposed Solution - -### Strategy: cell_boxes-First Rendering - -Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries: - -1. **Grid Inference from cell_boxes** - - Cluster cell_boxes by Y-coordinate to determine rows - - Cluster cell_boxes by X-coordinate to determine columns - - Build a row×col grid map from cell_boxes positions - -2. **Content Assignment from HTML** - - Extract text content from HTML in reading order - - Map text content to cell_boxes positions using coordinate matching - - Handle cases where HTML has fewer/more cells than cell_boxes - -3. **Direct PDF Rendering** - - Render table borders using cell_boxes coordinates (already implemented) - - Place text content at calculated cell positions - - Skip ReportLab Table parsing when cell_boxes grid is valid - -### Key Changes - -| Component | Change | -|-----------|--------| -| `pdf_generator_service.py` | Add cell_boxes-first rendering path | -| `table_content_rebuilder.py` | Enhance to support grid-based content mapping | -| `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting | - -## Benefits - -1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing -2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count -3. **Consistent Output**: Same rendering logic regardless of HTML complexity -4. **Backward Compatible**: Existing HTML-based rendering remains as fallback - -## Non-Goals - -- Not modifying PP-StructureV3 detection logic -- Not implementing table splitting (separate proposal if needed) -- Not changing Direct track (PyMuPDF) table extraction - -## Success Criteria - -1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes -2. All existing table tests continue to pass -3. No regression for tables where HTML grid matches cell_boxes diff --git a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md deleted file mode 100644 index bd61117..0000000 --- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md +++ /dev/null @@ -1,36 +0,0 @@ -# document-processing Specification Delta - -## MODIFIED Requirements - -### Requirement: Extract table structure (Modified) - -The system SHALL use cell_boxes coordinates as the primary source for table structure when rendering PDFs, with HTML parsing as fallback. - -#### Scenario: Render table using cell_boxes grid -- **WHEN** rendering a table element to PDF -- **AND** the table has valid cell_boxes coordinates -- **AND** `table_rendering_prefer_cellboxes` is enabled -- **THEN** the system SHALL infer row/column grid from cell_boxes coordinates -- **AND** extract text content from HTML in reading order -- **AND** map content to grid cells by position -- **AND** render table borders using cell_boxes coordinates -- **AND** place text content within calculated cell boundaries - -#### Scenario: Handle cell_boxes grid mismatch gracefully -- **WHEN** cell_boxes grid has different dimensions than HTML colspan/rowspan structure -- **THEN** the system SHALL use cell_boxes grid as authoritative structure -- **AND** map available HTML content to cells row-by-row -- **AND** leave unmapped cells empty -- **AND** log warning if content count differs significantly - -#### Scenario: Fallback to HTML-based rendering -- **WHEN** cell_boxes is empty or None -- **OR** `table_rendering_prefer_cellboxes` is disabled -- **OR** cell_boxes grid inference fails -- **THEN** the system SHALL fall back to existing HTML-based table rendering -- **AND** use ReportLab Table with parsed HTML structure - -#### Scenario: Maintain backward compatibility -- **WHEN** processing tables where cell_boxes grid matches HTML structure -- **THEN** the system SHALL produce identical output to previous behavior -- **AND** pass all existing table rendering tests diff --git a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md deleted file mode 100644 index f73c7d1..0000000 --- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md +++ /dev/null @@ -1,48 +0,0 @@ -## 1. Core Algorithm Implementation - -### 1.1 Grid Inference Module -- [x] 1.1.1 Create `CellBoxGridInferrer` class in `pdf_table_renderer.py` -- [x] 1.1.2 Implement `cluster_values()` for Y/X coordinate clustering -- [x] 1.1.3 Implement `infer_grid_from_cellboxes()` main method -- [x] 1.1.4 Add row_heights and col_widths calculation - -### 1.2 Content Mapping -- [x] 1.2.1 Implement `extract_cell_contents()` from HTML -- [x] 1.2.2 Implement `map_content_to_grid()` for row-by-row assignment -- [x] 1.2.3 Handle content count mismatch (more/fewer cells) - -## 2. PDF Generator Integration - -### 2.1 New Rendering Path -- [x] 2.1.1 Add `render_from_cellboxes_grid()` method to TableRenderer -- [x] 2.1.2 Integrate into `draw_table_region()` with cellboxes-first check -- [x] 2.1.3 Maintain fallback to existing HTML-based rendering - -### 2.2 Cell Rendering -- [x] 2.2.1 Draw cell borders using cell_boxes coordinates -- [x] 2.2.2 Render text content with proper alignment and padding -- [x] 2.2.3 Handle multi-line text within cells - -## 3. Configuration - -### 3.1 Settings -- [x] 3.1.1 Add `table_rendering_prefer_cellboxes: bool = True` -- [x] 3.1.2 Add `table_cellboxes_row_threshold: float = 15.0` -- [x] 3.1.3 Add `table_cellboxes_col_threshold: float = 15.0` - -## 4. Testing - -### 4.1 Unit Tests -- [x] 4.1.1 Test grid inference with various cell_box configurations -- [x] 4.1.2 Test content mapping edge cases -- [x] 4.1.3 Test coordinate clustering accuracy - -### 4.2 Integration Tests -- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case) -- [ ] 4.2.2 Verify no regression on existing table tests -- [ ] 4.2.3 Visual comparison of output PDFs - -## 5. Documentation - -- [x] 5.1 Update inline code comments -- [x] 5.2 Update spec with new table rendering requirement diff --git a/openspec/changes/archive/2025-12-12-add-batch-processing/proposal.md b/openspec/changes/archive/2025-12-12-add-batch-processing/proposal.md deleted file mode 100644 index 13acf31..0000000 --- a/openspec/changes/archive/2025-12-12-add-batch-processing/proposal.md +++ /dev/null @@ -1,43 +0,0 @@ -# Change: 新增批次處理功能 - -## Why - -目前系統支援批次上傳多個檔案,但處理時需要使用者逐一點選每個任務進行處理。這對於大量文件的處理場景非常不便。需要新增批次處理功能,讓使用者可以一次設定並啟動所有上傳的任務。 - -## What Changes - -### 1. 批次狀態管理 -- 擴展 taskStore 支援批次任務追蹤 -- 新增批次進度狀態(總數、已完成、處理中、失敗) -- 儲存批次統一設定 - -### 2. 批次處理邏輯 -- 上傳完成後分析所有檔案決定處理軌道 -- 根據軌道類型分流處理: - - Direct Track:最多 5 個並行(CPU 運算) - - OCR Track:單一佇列(GPU VRAM 限制) -- 兩類任務可同時進行 - -### 3. 批次設定 UI -- 修改 ProcessingPage 支援多任務模式 -- 統一設定介面: - - 處理策略(自動判斷/強制 OCR/強制 Direct) - - Layout Model(OCR 專用) - - 預處理模式(OCR 專用) -- 批次進度顯示(整體進度 + 各任務狀態) - -### 4. 處理策略 -- **自動判斷**(推薦):系統分析每個檔案後自動選擇最佳 track -- **全部 OCR**:強制所有檔案使用 OCR track -- **全部 Direct**:強制所有 PDF 使用 Direct track - -## Impact - -- Affected specs: frontend-ui (修改) -- Affected code: - - `frontend/src/store/taskStore.ts` - 擴展批次狀態 - - `frontend/src/pages/ProcessingPage.tsx` - 支援多任務處理 - - `frontend/src/pages/UploadPage.tsx` - 傳遞多任務 ID - - `frontend/src/services/apiV2.ts` - 新增批次處理輔助函數 - - `frontend/src/i18n/locales/*.json` - 新增翻譯 -- 後端無需改動(利用現有 API) diff --git a/openspec/changes/archive/2025-12-12-add-batch-processing/specs/frontend-ui/spec.md b/openspec/changes/archive/2025-12-12-add-batch-processing/specs/frontend-ui/spec.md deleted file mode 100644 index 1fa95fe..0000000 --- a/openspec/changes/archive/2025-12-12-add-batch-processing/specs/frontend-ui/spec.md +++ /dev/null @@ -1,100 +0,0 @@ -# Frontend UI Specification - Batch Processing - -## ADDED Requirements - -### Requirement: Batch Processing Support - -The system SHALL support batch processing of multiple uploaded files with a single configuration. - -After uploading multiple files, the user SHALL be able to: -- Configure processing settings once for all files -- Start processing all files with one action -- Monitor progress of all files in a unified view - -#### Scenario: Multiple files uploaded -- **WHEN** user uploads multiple files -- **AND** navigates to processing page -- **THEN** the system displays batch processing mode -- **AND** shows all pending tasks in a list - -#### Scenario: Batch configuration -- **WHEN** user is in batch processing mode -- **THEN** user can select a processing strategy (auto/OCR/Direct) -- **AND** user can configure layout model for OCR tasks -- **AND** user can configure preprocessing for OCR tasks -- **AND** settings apply to all applicable tasks - ---- - -### Requirement: Batch Processing Strategy - -The system SHALL support three batch processing strategies: - -1. **Auto Detection** (default): System analyzes each file and selects optimal track -2. **Force OCR**: All files processed with OCR track -3. **Force Direct**: All PDF files processed with Direct track - -#### Scenario: Auto detection strategy -- **WHEN** user selects auto detection strategy -- **THEN** the system analyzes each file before processing -- **AND** assigns OCR or Direct track based on file characteristics - -#### Scenario: Force OCR strategy -- **WHEN** user selects force OCR strategy -- **THEN** all files are processed using OCR track -- **AND** layout model and preprocessing settings are applied - -#### Scenario: Force Direct strategy -- **WHEN** user selects force Direct strategy -- **AND** file is a PDF -- **THEN** the file is processed using Direct track - ---- - -### Requirement: Parallel Processing Limits - -The system SHALL enforce different parallelism limits based on processing track: - -- Direct Track: Maximum 5 concurrent tasks (CPU-based) -- OCR Track: Maximum 1 concurrent task (GPU VRAM constraint) - -Direct and OCR tasks MAY run simultaneously as they use different resources. - -#### Scenario: Direct track parallelism -- **WHEN** batch contains multiple Direct track tasks -- **THEN** up to 5 tasks process concurrently -- **AND** remaining tasks wait in queue - -#### Scenario: OCR track serialization -- **WHEN** batch contains multiple OCR track tasks -- **THEN** only 1 task processes at a time -- **AND** remaining tasks wait in queue - -#### Scenario: Mixed track processing -- **WHEN** batch contains both Direct and OCR tasks -- **THEN** Direct tasks run in parallel pool (max 5) -- **AND** OCR tasks run in serial queue (max 1) -- **AND** both pools operate simultaneously - ---- - -### Requirement: Batch Progress Display - -The system SHALL display unified progress for batch processing. - -Progress display SHALL include: -- Overall progress (completed / total) -- Count by status (processing, completed, failed) -- Individual task status list -- Estimated time remaining (optional) - -#### Scenario: Batch progress monitoring -- **WHEN** batch processing is in progress -- **THEN** user sees overall completion percentage -- **AND** user sees count of tasks in each status -- **AND** user sees status of each individual task - -#### Scenario: Batch completion -- **WHEN** all tasks in batch are completed or failed -- **THEN** user sees final summary -- **AND** user can navigate to results page diff --git a/openspec/changes/archive/2025-12-12-add-batch-processing/tasks.md b/openspec/changes/archive/2025-12-12-add-batch-processing/tasks.md deleted file mode 100644 index 8f6af21..0000000 --- a/openspec/changes/archive/2025-12-12-add-batch-processing/tasks.md +++ /dev/null @@ -1,42 +0,0 @@ -# Tasks: 新增批次處理功能 - -## 1. 批次狀態管理 - -- [x] 1.1 擴展 taskStore 新增批次狀態介面(BatchState) -- [x] 1.2 實作批次任務追蹤(taskIds、taskStates) -- [x] 1.3 實作批次進度計算(total、completed、processing、failed) -- [x] 1.4 實作批次設定儲存(processingOptions) - -## 2. 批次處理邏輯 - -- [x] 2.1 新增批次分析函數(分析所有任務決定 track) -- [x] 2.2 實作 Direct Track 並行處理(最多 5 並行) -- [x] 2.3 實作 OCR Track 佇列處理(單一佇列) -- [x] 2.4 實作混合模式處理(Direct 和 OCR 同時進行) -- [x] 2.5 實作任務狀態輪詢與更新 - -## 3. 上傳頁面調整 - -- [x] 3.1 修改 UploadPage 上傳完成後儲存所有 taskIds -- [x] 3.2 導航至 ProcessingPage 時傳遞批次模式標記 - -## 4. 處理頁面重構 - -- [x] 4.1 修改 ProcessingPage 支援批次模式 -- [x] 4.2 新增批次設定區塊(策略選擇、統一設定) -- [x] 4.3 新增批次進度顯示元件 -- [x] 4.4 新增任務列表顯示(各任務狀態) -- [x] 4.5 實作批次開始處理按鈕 - -## 5. i18n 翻譯 - -- [x] 5.1 新增批次處理相關中文翻譯 -- [x] 5.2 新增批次處理相關英文翻譯 - -## 6. 測試與驗證 - -- [x] 6.1 測試單檔案處理(向下相容) -- [x] 6.2 測試多檔案 Direct Track 並行 -- [x] 6.3 測試多檔案 OCR Track 佇列 -- [x] 6.4 測試混合模式處理 -- [x] 6.5 驗證 TypeScript 編譯通過 diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md deleted file mode 100644 index 6f949f9..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md +++ /dev/null @@ -1,51 +0,0 @@ -# Change: Fix OCR Track Reflow PDF - -## Why - -The OCR Track reflow PDF generation is missing most content because: - -1. PP-StructureV3 extracts tables as elements but stores `content: ""` (empty string) instead of structured `content.cells` data -2. The `generate_reflow_pdf` method expects `content.cells` for tables, so tables are skipped -3. Table text exists in `raw_ocr_regions.json` (59 text blocks) but is not used by reflow PDF generation -4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions - -The Layout PDF works correctly because it uses `raw_ocr_regions.json` via Simple Text Positioning mode, bypassing the need for structured table data. - -## What Changes - -### Reflow PDF Generation for OCR Track - -Modify `generate_reflow_pdf` to use `raw_ocr_regions.json` as the primary text source for OCR Track documents: - -1. **Detect processing track** from JSON metadata -2. **For OCR Track**: Load `raw_ocr_regions.json` and render all text blocks in reading order -3. **For Direct Track**: Continue using `content.cells` for tables (already works) -4. **Images/Charts**: Continue using `content.saved_path` from elements (works for both tracks) - -### Data Flow - -**OCR Track Reflow PDF (NEW):** -``` -raw_ocr_regions.json (59 text blocks) - + scan_result.json (images/charts only) - → Sort by Y coordinate (reading order) - → Render text paragraphs + images -``` - -**Direct Track Reflow PDF (UNCHANGED):** -``` -*_result.json (elements with content.cells) - → Render tables, text, images in order -``` - -## Impact - -- **Affected file**: `backend/app/services/pdf_generator_service.py` -- **User experience**: OCR Track reflow PDF will contain all text content (matching Layout PDF) -- **Translation**: Reflow translated PDF will also work correctly for OCR Track - -## Migration - -- No data migration required -- Existing `raw_ocr_regions.json` files contain all necessary data -- No API changes diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md deleted file mode 100644 index 5d50832..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md +++ /dev/null @@ -1,23 +0,0 @@ -## MODIFIED Requirements - -### Requirement: Enhanced PDF Export with Layout Preservation - -The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity. - -#### Scenario: OCR Track reflow PDF uses raw OCR regions -- **WHEN** generating reflow PDF for an OCR Track document -- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files -- **AND** text blocks SHALL be sorted by Y coordinate for reading order -- **AND** all text content SHALL match the Layout PDF output -- **AND** images and charts SHALL be embedded from element `saved_path` - -#### Scenario: Direct Track reflow PDF uses structured content -- **WHEN** generating reflow PDF for a Direct Track document -- **THEN** the system SHALL use `content.cells` for table rendering -- **AND** text elements SHALL use `content` string directly -- **AND** images and charts SHALL be embedded from element `saved_path` - -#### Scenario: Reflow PDF content consistency -- **WHEN** comparing Layout PDF and Reflow PDF for the same document -- **THEN** both PDFs SHALL contain the same text content -- **AND** only the presentation format SHALL differ (positioned vs flowing) diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md deleted file mode 100644 index de52831..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md +++ /dev/null @@ -1,51 +0,0 @@ -# Tasks: Fix OCR Track Reflow PDF - -## 1. Modify generate_reflow_pdf Method - -- [x] 1.1 Add processing track detection - - File: `backend/app/services/pdf_generator_service.py` - - Location: `generate_reflow_pdf` method (line ~4704) - - Read `metadata.processing_track` from JSON data - - Branch logic based on track type - -- [x] 1.2 Add helper function to load raw OCR regions - - File: `backend/app/services/pdf_generator_service.py` - - Using existing: `load_raw_ocr_regions` from `text_region_renderer.py` - - Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json` - - Return: List of text regions with bbox and content - -- [x] 1.3 Implement OCR Track reflow rendering - - File: `backend/app/services/pdf_generator_service.py` - - For OCR Track: Load raw OCR regions per page - - Sort text blocks by Y coordinate (top to bottom reading order) - - Render text blocks as paragraphs - - Still render images/charts from elements - -- [x] 1.4 Keep Direct Track logic unchanged - - File: `backend/app/services/pdf_generator_service.py` - - Direct Track continues using `content.cells` for tables - - Extracted to `_render_reflow_elements` helper method - - No changes to existing Direct Track flow - -## 2. Handle Multi-page Documents - -- [x] 2.1 Support per-page raw OCR files - - Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json` - - Iterate through pages and load corresponding raw OCR file - - Handle missing files gracefully (fall back to elements) - -## 3. Testing - -- [x] 3.1 Test OCR Track reflow PDF - - Test with: `a9259180-fc49-4890-8184-2e6d5f4edad3` (scan document) - - Verify: All 59 text blocks appear in reflow PDF - - Verify: Images are embedded correctly - -- [x] 3.2 Test Direct Track reflow PDF - - Test with: `1b32428d-0609-4cfd-bc52-56be6956ac2e` (editable PDF) - - Verify: Tables render with cells - - Verify: No regression from changes - -- [x] 3.3 Test translated reflow PDF - - Test: Complete translation then download reflow PDF - - Verify: Translated text appears correctly diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md deleted file mode 100644 index f4fae0a..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md +++ /dev/null @@ -1,70 +0,0 @@ -# Change: Fix OCR Track Translation - -## Why - -OCR Track translation is missing most content because: - -1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json` -2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data -3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it -4. Result: Only 6 text elements translated vs 59 raw OCR regions available - -**Current Data Flow (OCR Track):** -``` -scan_result.json (10 elements, 6 text, 2 empty tables) - → Translation extracts 6 text items - → 53 text blocks in tables are NOT translated -``` - -**Expected Data Flow (OCR Track):** -``` -raw_ocr_regions.json (59 text blocks) - → Translation extracts ALL 59 text items - → Complete translation coverage -``` - -## What Changes - -### 1. Translation Service Enhancement - -Modify `translate_document` in `translation_service.py` to: - -1. **Detect processing track** from result JSON metadata -2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements -3. **For Direct Track**: Continue using elements with `content.cells` (already works) - -### 2. Translation Result Format for OCR Track - -Add new field `raw_ocr_translations` to translation JSON for OCR Track: - -```json -{ - "translations": { ... }, // element-based (for Direct Track) - "raw_ocr_translations": [ // NEW: for OCR Track - { - "index": 0, - "original": "华天科技(宝鸡)有限公司", - "translated": "Huatian Technology (Baoji) Co., Ltd." - }, - ... - ] -} -``` - -### 3. Translated PDF Generation - -Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents. - -## Impact - -- **Affected files**: - - `backend/app/services/translation_service.py` - extraction and translation logic - - `backend/app/services/pdf_generator_service.py` - translated PDF rendering -- **User experience**: OCR Track translations will include ALL text content -- **API**: Translation JSON format extended (backward compatible) - -## Migration - -- No data migration required -- Existing translations continue to work (Direct Track unaffected) -- Re-translation needed for OCR Track documents to get full coverage diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md deleted file mode 100644 index 08e6192..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md +++ /dev/null @@ -1,56 +0,0 @@ -# translation Specification Delta - -## MODIFIED Requirements - -### Requirement: Translation Content Extraction - -The translation service SHALL extract content based on processing track type. - -#### Scenario: OCR Track translation extraction -- **GIVEN** a document processed with OCR Track -- **AND** the result JSON has `metadata.processing_track = "ocr"` -- **WHEN** translation service extracts translatable content -- **THEN** it SHALL load `raw_ocr_regions.json` for each page -- **AND** it SHALL extract all text blocks from raw OCR regions -- **AND** it SHALL NOT rely on `content.cells` from table elements - -#### Scenario: Direct Track translation extraction (unchanged) -- **GIVEN** a document processed with Direct Track -- **AND** the result JSON has `metadata.processing_track = "direct"` or no track specified -- **WHEN** translation service extracts translatable content -- **THEN** it SHALL extract from `pages[].elements[]` in result JSON -- **AND** it SHALL extract table cell content from `content.cells` - -### Requirement: Translation Result Format - -The translation result JSON SHALL support both element-based and raw OCR translations. - -#### Scenario: OCR Track translation result format -- **GIVEN** an OCR Track document has been translated -- **WHEN** translation result is saved -- **THEN** the JSON SHALL include `raw_ocr_translations` array -- **AND** each item SHALL have `index`, `original`, and `translated` fields -- **AND** the `translations` object MAY be empty or contain header text translations - -#### Scenario: Direct Track translation result format (unchanged) -- **GIVEN** a Direct Track document has been translated -- **WHEN** translation result is saved -- **THEN** the JSON SHALL use `translations` object mapping element_id to translated text -- **AND** `raw_ocr_translations` field SHALL NOT be present - -### Requirement: Translated PDF Generation - -The translated PDF generation SHALL use appropriate translation source based on processing track. - -#### Scenario: OCR Track translated PDF generation -- **GIVEN** an OCR Track document with translations -- **AND** the translation JSON contains `raw_ocr_translations` -- **WHEN** generating translated reflow PDF -- **THEN** it SHALL apply translations from `raw_ocr_translations` by index -- **AND** it SHALL render all translated text blocks in reading order - -#### Scenario: Direct Track translated PDF generation (unchanged) -- **GIVEN** a Direct Track document with translations -- **WHEN** generating translated reflow PDF -- **THEN** it SHALL apply translations from `translations` object by element_id -- **AND** existing behavior SHALL be unchanged diff --git a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/tasks.md b/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/tasks.md deleted file mode 100644 index 98207c0..0000000 --- a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/tasks.md +++ /dev/null @@ -1,76 +0,0 @@ -# Tasks: Fix OCR Track Translation - -## 1. Modify Translation Service - -- [x] 1.1 Add processing track detection - - File: `backend/app/services/translation_service.py` - - Location: `translate_document` method - - Read `metadata.processing_track` from result JSON - - Pass track type to extraction method - -- [x] 1.2 Create helper to load raw OCR regions - - File: `backend/app/services/translation_service.py` - - Function: `_load_raw_ocr_regions(result_dir, task_id, page_num)` - - Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json` - - Return: List of text regions with index and content - -- [x] 1.3 Modify extract_translatable_elements for OCR Track - - File: `backend/app/services/translation_service.py` - - Added: `extract_translatable_elements_ocr_track` method - - Added parameters: `result_dir: Path`, `task_id: str` - - For OCR Track: Extract from raw_ocr_regions.json - - For Direct Track: Keep existing element-based extraction - -- [x] 1.4 Update translation result format - - File: `backend/app/services/translation_service.py` - - Location: `build_translation_result` method - - Added `processing_track` parameter - - For OCR Track: Output `raw_ocr_translations` field - - Structure: `[{"page": 1, "index": 0, "original": "...", "translated": "..."}]` - -## 2. Modify PDF Generation - -- [x] 2.1 Update generate_translated_pdf for OCR Track - - File: `backend/app/services/pdf_generator_service.py` - - Detect `processing_track` and `raw_ocr_translations` from translation JSON - - For OCR Track: Call `_generate_translated_pdf_ocr_track` - - For Direct Track: Continue using `apply_translations` (element-based) - -- [x] 2.2 Create helper to apply raw OCR translations - - File: `backend/app/services/pdf_generator_service.py` - - Function: `_generate_translated_pdf_ocr_track` - - Build translation lookup: `{(page, index): translated_text}` - - Load raw OCR regions, sort by Y coordinate - - Render translated text with original fallback - -## 3. Additional Fixes - -- [x] 3.1 Add page_number to TranslatedItem - - File: `backend/app/schemas/translation.py` - - Added `page_number: int = 1` to TranslatedItem dataclass - - Updated `translate_batch` and `translate_item` to pass page_number - -- [x] 3.2 Update API endpoint validation - - File: `backend/app/routers/translate.py` - - Check for both `translations` (Direct Track) and `raw_ocr_translations` (OCR Track) - -- [x] 3.3 Filter text overlapping with images - - File: `backend/app/services/pdf_generator_service.py` - - Added `_collect_exclusion_zones`, `_is_region_overlapping_exclusion`, `_filter_regions_by_exclusion` - - Applied filtering in `generate_reflow_pdf` and `_generate_translated_pdf_ocr_track` - -## 4. Testing - -- [x] 4.1 Test OCR Track translation - - Test with: `f8265449-6cb7-425d-a213-5d2e1af73955` - - Verify: All 59 text blocks are sent for translation - - Verify: Translation JSON contains `raw_ocr_translations` - -- [x] 4.2 Test OCR Track translated PDF - - Generate translated reflow PDF - - Verify: All translated text blocks appear correctly - - Verify: Text inside images (like EWsenel) is filtered out - -- [x] 4.3 Test Direct Track unchanged - - Verify: Translation still uses element-based approach - - Verify: No regression in Direct Track flow diff --git a/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/proposal.md b/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/proposal.md deleted file mode 100644 index 1121e8f..0000000 --- a/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/proposal.md +++ /dev/null @@ -1,201 +0,0 @@ -# Proposal: 優化任務檔案生成與視覺化下載 - -## Summary - -優化 OCR/Direct Track 任務處理過程中的檔案生成策略,移除不必要的檔案,並提供視覺化圖片下載功能。 - -## 檔案變更總覽 - -### OCR Track 檔案變更 - -| 檔案 | 目前狀態 | 變更後 | 影響 | -|-----|---------|-------|------| -| `*_result.json` | 生成 | **保留** | 核心資料,API/前端依賴 | -| `*_output.md` | 生成 | **停止生成** | 移除前端下載按鈕 | -| `*_layout.pdf` / `*_reflow.pdf` | 生成 | **保留** | 主要輸出格式 | -| `*_raw_ocr_regions.json` | 生成 | **保留** | 翻譯服務依賴 | -| `*_scan_page_N.png` | 生成 | **保留** | OCR 處理和 PDF 生成需要 | -| `visualization/*.png` | 生成 | **保留** | 新增下載功能 | -| `standalone_img_*.png` | 生成 | **保留** | result.json 引用,PDF 生成需要 | -| `img_in_table_*.png` | 生成 | **保留** | result.json 引用,PDF 生成需要 | -| `pp3_*.png` | 生成 | **保留** | result.json 引用,PDF 生成需要 | -| `*_pp_structure_raw.json` | 生成 | **停止生成** | 純 Debug,預設關閉 | -| `*_debug_summary.json` | 生成 | **停止生成** | 純 Debug,預設關閉 | -| `*_pp_structure_viz.png` | 生成 | **停止生成** | 純 Debug,預設關閉 | - -### Direct Track 檔案變更 - -| 檔案 | 目前狀態 | 變更後 | 影響 | -|-----|---------|-------|------| -| `*_result.json` | 生成 | **保留** | 核心資料,API/前端依賴 | -| `*_output.md` | 生成 | **停止生成** | 移除前端下載按鈕 | -| `*_layout.pdf` / `*_reflow.pdf` | 生成 | **保留** | 主要輸出格式 | -| `f66673cc_p*_img*.png` | 生成 | **保留** | result.json 引用,PDF 生成需要 | -| `f66673cc_p*_chart*.png` | 生成 | **保留** | result.json 引用,PDF 生成需要 | - -### 變更摘要 - -| Track | 停止生成的檔案 | 預估節省空間 | -|-------|--------------|-------------| -| OCR Track | `*_output.md`, `*_pp_structure_raw.json`, `*_debug_summary.json`, `*_pp_structure_viz.png` | ~300-1500 KB/頁 | -| Direct Track | `*_output.md` | ~1-3 KB/檔案 | - -## 後端變更 - -### 1. config.py - 修改預設值 - -```python -# 修改前 -pp_structure_debug_enabled: bool = Field(default=True) -pp_structure_debug_visualization: bool = Field(default=True) - -# 修改後 -pp_structure_debug_enabled: bool = Field(default=False) -pp_structure_debug_visualization: bool = Field(default=False) -``` - -**影響**:OCR Track 不再生成 debug 檔案(`*_pp_structure_raw.json`, `*_debug_summary.json`, `*_pp_structure_viz.png`) - -### 2. unified_document_exporter.py - 停止生成 Markdown - -修改 `export_all()` 方法,不再生成 `*_output.md` 檔案。 - -**影響**:兩個 Track 都不再生成 Markdown 檔案 - -### 3. ocr_service.py - 更新 save_results() - -修改 `save_results()` 方法,不再生成 Markdown 檔案,返回值調整。 - -### 4. tasks.py (router) - 移除 Markdown 下載端點 - -移除或標記棄用 `GET /api/v2/tasks/{task_id}/download/markdown` 端點。 - -### 5. tasks.py (router) - 新增 visualization 下載端點 - -```python -@router.get("/{task_id}/visualization-download") -async def download_visualization_zip(task_id: str, ...): - """ - Download visualization images as ZIP file. - Only available for OCR Track tasks with visualization folder. - """ - # 檢查 visualization 資料夾是否存在 - # 打包資料夾內所有 PNG 為 ZIP - # 返回 StreamingResponse (application/zip) -``` - -### 6. Task model/schema - 更新欄位 - -- 移除 `result_markdown_path` 欄位使用(保留欄位但不再寫入) -- 新增 `has_visualization: bool` 到 TaskDetail response - -## 前端變更 - -### 1. TaskHistoryPage.tsx - 移除 Markdown 下載按鈕 - -```tsx -// 移除此段 -{task.result_markdown_path && ( - -)} -``` - -### 2. ResultsPage.tsx - 移除 Markdown 下載按鈕 - -```tsx -// 移除此段 - -``` - -### 3. apiV2.ts - 移除/新增 API 方法 - -```typescript -// 移除 -async downloadMarkdown(taskId: string): Promise - -// 新增 -async downloadVisualization(taskId: string): Promise -``` - -### 4. types/apiV2.ts - 更新 TaskDetail type - -```typescript -export interface TaskDetail { - // ... 現有欄位 - has_visualization?: boolean // 新增 -} -``` - -### 5. TaskDetailPage.tsx - 新增 visualization 下載按鈕 - -```tsx -// OCR Track 且有 visualization 時顯示 -{task.has_visualization && ( - -)} -``` - -## 依賴關係確認 - -### 必須保留的檔案及原因 - -| 檔案 | 依賴來源 | 用途 | -|-----|---------|------| -| `*_result.json` | API、前端、翻譯服務 | 核心結構化資料 | -| `*_raw_ocr_regions.json` | `translation_service.py` | OCR Track 翻譯時讀取 | -| `*_scan_page_N.png` | `pdf_generator_service.py` | Reflow PDF 生成 | -| `visualization/*.png` | 使用者下載 | OCR 辨識結果視覺化 | -| 所有提取的圖片 | `*_result.json` 中的 `saved_path` | PDF 生成時嵌入圖片 | - -### 可移除的檔案及原因 - -| 檔案 | 原因 | -|-----|------| -| `*_output.md` | 前端移除下載按鈕後無使用場景 | -| `*_pp_structure_raw.json` | 純 Debug 用途,生產環境不需要 | -| `*_debug_summary.json` | 純 Debug 用途,生產環境不需要 | -| `*_pp_structure_viz.png` | 純 Debug 用途,生產環境不需要 | - -## 設定說明 - -### 後端設定 (.env.local) - -```bash -# Debug 檔案生成(預設關閉) -PP_STRUCTURE_DEBUG_ENABLED=false -PP_STRUCTURE_DEBUG_VISUALIZATION=false - -# 如需開啟 debug 檔案生成 -PP_STRUCTURE_DEBUG_ENABLED=true -PP_STRUCTURE_DEBUG_VISUALIZATION=true -``` - -### 前端設定 - -無需額外設定,移除下載按鈕後自動生效。 - -## 向後相容性 - -1. **API 端點** - `GET /download/markdown` 可保留但返回 404 或棄用訊息 -2. **資料庫欄位** - `result_markdown_path` 欄位保留,但新任務不再寫入 -3. **舊任務** - 已存在的 Markdown 檔案不受影響,仍可下載 - -## Implementation Plan - -1. 後端:修改 config.py 預設值(關閉 debug) -2. 後端:修改 unified_document_exporter.py 停止生成 Markdown -3. 後端:修改 ocr_service.py save_results() 不生成 Markdown -4. 後端:新增 visualization 下載端點 -5. 後端:更新 TaskDetail response 加入 has_visualization -6. 前端:移除 TaskHistoryPage Markdown 下載按鈕 -7. 前端:移除 ResultsPage Markdown 下載按鈕 -8. 前端:移除 apiV2.ts downloadMarkdown 方法 -9. 前端:新增 visualization 下載功能 -10. 測試並驗證 diff --git a/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/tasks.md b/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/tasks.md deleted file mode 100644 index c4a934f..0000000 --- a/openspec/changes/archive/2025-12-12-optimize-task-files-and-visualization/tasks.md +++ /dev/null @@ -1,68 +0,0 @@ -# Tasks: 優化任務檔案生成與視覺化下載 - -## 1. 後端設定優化 - -- [x] 1.1 修改 `config.py` debug 預設值 - - `pp_structure_debug_enabled`: `True` → `False` - - `pp_structure_debug_visualization`: `True` → `False` - -## 2. 後端 Visualization 下載 API - -- [x] 2.1 在 `tasks.py` 新增 visualization 下載端點 - - `GET /api/v2/tasks/{task_id}/download/visualization` - - 檢查 visualization 資料夾是否存在 - - 打包資料夾內所有 PNG 為 ZIP - - 返回 StreamingResponse (application/zip) - -- [x] 2.2 在 TaskDetail response 中加入 `has_visualization` 欄位 - - 檢查 task result directory 下是否有 visualization 資料夾 - - 回傳 boolean 值 - -## 3. 前端 Visualization 下載功能 - -- [x] 3.1 在 `types/apiV2.ts` 更新 TaskDetail type - - 新增 `has_visualization?: boolean` - -- [x] 3.2 在 `apiV2.ts` 新增下載方法 - - `downloadVisualization(taskId: string): Promise` - -- [x] 3.3 在 `TaskDetailPage.tsx` 新增下載按鈕 - - 只有 `has_visualization = true` 時顯示 - - 點擊後下載 ZIP 檔案 - -## 4. 停止生成 Markdown 檔案 - -- [x] 4.1 修改 `ocr_service.py` 的 `save_results()` 方法 - - 移除 Markdown 檔案生成 - - 返回值中 `markdown_path` 始終為 `None` - -- [x] 4.2 修改 `unified_document_exporter.py` - - `export_all()`: 移除 Markdown 導出 - - `export_formats()`: 移除 Markdown 支援 - -- [x] 4.3 前端 TaskHistoryPage.tsx 移除 JSON/MD 下載按鈕 - - 改為版面 PDF 和流式 PDF 兩個下載按鈕 - -## 5. 確保 raw_ocr_regions.json 正常生成 - -- [x] 5.1 將 `raw_ocr_regions.json` 生成從 debug 區塊分離 - - 獨立於 `pp_structure_debug_enabled` 設定 - - 此檔案為 PDF 生成和翻譯服務所必需 - -- [x] 5.2 在 `pp_structure_debug.py` 新增 `save_debug_results()` 方法 - - 只保存純 debug 檔案(`_pp_structure_raw.json`, `_debug_summary.json`) - - 不再重複保存 `_raw_ocr_regions.json` - -## 6. Bug 修復 - -- [x] 6.1 修復 Processing 頁面不切換到新任務的問題 - - 在 `useTaskValidation.ts` 中加入 taskId 變化時重置 `isNotFound` 的邏輯 - -## 7. 測試與驗證 - -- [x] 7.1 驗證 TypeScript 編譯通過 -- [ ] 7.2 驗證 `*_raw_ocr_regions.json` 仍正常生成 -- [ ] 7.3 驗證 visualization 資料夾仍正常生成 -- [ ] 7.4 測試 visualization 下載功能 -- [ ] 7.5 驗證 PDF 內容正常顯示 -- [ ] 7.6 驗證新任務上傳後 Processing 頁面正確切換 diff --git a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/proposal.md b/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/proposal.md deleted file mode 100644 index f030b8b..0000000 --- a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/proposal.md +++ /dev/null @@ -1,38 +0,0 @@ -# Change: 前端 UX 簡化與 i18n 英文支援 - -## Why - -目前登入頁面設計過於花俏(漸層動畫、浮動光球、脈衝效果),與內部頁面的專業簡約風格不一致。此外,頁面包含不實宣傳文案(如「99% 準確率」、「企業級加密」等未經證實的聲明)。系統目前僅支援繁體中文,缺乏多語言支援。 - -## What Changes - -### 1. LoginPage 簡化 -- 移除花俏動畫效果(浮動光球、網格圖案、脈衝動畫) -- 移除漸層背景,改用簡潔單色背景 -- 移除不實宣傳區塊(「為什麼選擇我們」、統計數據卡片) -- 統一登入頁與內部頁面視覺風格 - -### 2. 文案修正 -- 移除誇大宣稱(「99% 準確率」、「閃電般快速」、「企業級加密」) -- 改用務實功能描述 - -### 3. i18n 擴充 -- 新增英文 (en-US) 翻譯檔案 -- 新增語言切換功能元件 -- 儲存使用者語言偏好至 localStorage -- 將語言切換器整合至 Layout 頂部欄 - -### 4. 整體風格統一 -- 確保所有頁面使用一致的設計語言 -- 遵循專業簡約風格準則 - -## Impact - -- Affected specs: frontend-ui (新增) -- Affected code: - - `frontend/src/pages/LoginPage.tsx` - 重新設計 - - `frontend/src/components/Layout.tsx` - 新增語言切換器 - - `frontend/src/i18n/index.ts` - 擴充多語言設定 - - `frontend/src/i18n/locales/en-US.json` - 新增英文翻譯 - - `frontend/src/i18n/locales/zh-TW.json` - 補充缺少的翻譯鍵 - - `frontend/src/components/LanguageSwitcher.tsx` - 新增元件 diff --git a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/specs/frontend-ui/spec.md b/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/specs/frontend-ui/spec.md deleted file mode 100644 index c3b6bfb..0000000 --- a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/specs/frontend-ui/spec.md +++ /dev/null @@ -1,89 +0,0 @@ -# Frontend UI Specification - -## ADDED Requirements - -### Requirement: Minimal Login Page Design - -The login page SHALL use a professional minimal design style that is consistent with the rest of the application. - -The login page SHALL NOT include: -- Animated gradient backgrounds -- Floating decorative elements (orbs, particles) -- Pulsing or floating animations -- Marketing claims or statistics -- Feature promotion sections - -The login page SHALL include: -- Centered login form with clean white card -- Application logo and name -- Username and password input fields -- Login button with loading state -- Error message display area -- Simple solid color background - -#### Scenario: Login page renders with minimal design -- **WHEN** user navigates to the login page -- **THEN** the page displays a centered login form -- **AND** no animated decorative elements are visible -- **AND** no marketing content is displayed - -#### Scenario: Login form visual consistency -- **WHEN** comparing login page to internal pages -- **THEN** the visual style (colors, typography, spacing) is consistent - ---- - -### Requirement: Multi-language Support - -The application SHALL support multiple languages with user-selectable language preference. - -Supported languages: -- Traditional Chinese (zh-TW) - Default -- English (en-US) - -The language selection SHALL be persisted in localStorage and restored on page reload. - -#### Scenario: Language switcher available -- **WHEN** user is logged in and viewing any page -- **THEN** a language switcher component is visible in the top navigation bar - -#### Scenario: Switch to English -- **WHEN** user selects English from the language switcher -- **THEN** all UI text immediately changes to English -- **AND** the preference is saved to localStorage - -#### Scenario: Switch to Traditional Chinese -- **WHEN** user selects Traditional Chinese from the language switcher -- **THEN** all UI text immediately changes to Traditional Chinese -- **AND** the preference is saved to localStorage - -#### Scenario: Language preference persistence -- **WHEN** user has previously selected a language preference -- **AND** user reloads the page or returns later -- **THEN** the application displays in the previously selected language - ---- - -### Requirement: Accurate Product Description - -All user-facing text SHALL accurately describe the product capabilities without exaggeration. - -The application SHALL NOT display: -- Unverified accuracy percentages (e.g., "99% accuracy") -- Superlative marketing claims (e.g., "lightning fast", "enterprise-grade") -- Unsubstantiated statistics -- Comparative claims without evidence - -The application MAY display: -- Factual feature descriptions -- Supported file formats -- Authentication method information - -#### Scenario: Login page displays factual information -- **WHEN** user views the login page -- **THEN** only factual product information is displayed -- **AND** no unverified claims are present - -#### Scenario: Feature descriptions are accurate -- **WHEN** any page describes product features -- **THEN** the descriptions are factual and verifiable diff --git a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/tasks.md b/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/tasks.md deleted file mode 100644 index 1336020..0000000 --- a/openspec/changes/archive/2025-12-12-refactor-frontend-ux-i18n/tasks.md +++ /dev/null @@ -1,29 +0,0 @@ -# Tasks: 前端 UX 簡化與 i18n 英文支援 - -## 1. LoginPage 簡化 - -- [x] 1.1 移除動畫背景元素(浮動光球、網格圖案、脈衝效果) -- [x] 1.2 替換漸層背景為簡潔單色背景 -- [x] 1.3 移除左側宣傳區塊(「為什麼選擇我們」、功能特色、統計數據) -- [x] 1.4 重新設計登入表單區塊,採用居中簡約版面 -- [x] 1.5 移除不必要的動畫 class(animate-float, animate-slide-in-left 等) - -## 2. i18n 英文支援 - -- [x] 2.1 建立 `frontend/src/i18n/locales/en-US.json` 英文翻譯檔 -- [x] 2.2 更新 `frontend/src/i18n/index.ts` 支援多語言切換 -- [x] 2.3 補充 `zh-TW.json` 缺少的翻譯鍵(登入頁相關) - -## 3. 語言切換功能 - -- [x] 3.1 建立 `frontend/src/components/LanguageSwitcher.tsx` 元件 -- [x] 3.2 整合語言切換器至 `Layout.tsx` 頂部欄 -- [x] 3.3 實作語言偏好 localStorage 持久化 -- [x] 3.4 確保語言切換即時生效(無需重新載入頁面) - -## 4. 測試與驗證 - -- [x] 4.1 驗證 LoginPage 在不同螢幕尺寸的顯示效果 -- [x] 4.2 驗證中英文切換功能正常運作 -- [x] 4.3 驗證語言偏好在頁面重新載入後保持 -- [x] 4.4 檢查所有頁面的翻譯完整性 diff --git a/openspec/changes/archive/2025-12-14-add-storage-cleanup/proposal.md b/openspec/changes/archive/2025-12-14-add-storage-cleanup/proposal.md deleted file mode 100644 index 1cd48ea..0000000 --- a/openspec/changes/archive/2025-12-14-add-storage-cleanup/proposal.md +++ /dev/null @@ -1,60 +0,0 @@ -# Change: Add Storage Cleanup Mechanism - -## Why -目前系統缺乏完整的磁碟空間管理機制: -- `delete_task` 只刪除資料庫記錄,不刪除實際檔案 -- `auto_cleanup_expired_tasks` 存在但從未被調用 -- 上傳檔案 (uploads/) 和結果檔案 (storage/results/) 會無限累積 - -用戶需要: -1. 定期清理過期檔案以節省磁碟空間 -2. 保留資料庫記錄以便管理員查看累計統計(TOKEN、成本、用量) -3. 軟刪除機制讓用戶可以「刪除」任務但不影響統計 - -## What Changes - -### Backend Changes -1. **Task Model 擴展** - - 新增 `deleted_at` 欄位實現軟刪除 - - 保留現有 `file_deleted` 欄位追蹤檔案清理狀態 - -2. **Task Service 更新** - - `delete_task()` 改為軟刪除(設置 `deleted_at`,不刪檔案) - - 用戶查詢自動過濾 `deleted_at IS NOT NULL` 的記錄 - - 新增 `cleanup_expired_files()` 方法清理過期檔案 - -3. **Cleanup Service 新增** - - 定期排程任務(可配置間隔,建議每日) - - 清理邏輯:每用戶保留最新 N 筆任務的檔案(預設 50) - - 只刪除檔案,不刪除資料庫記錄(保留統計數據) - -4. **Admin Endpoints 擴展** - - 新增 `/api/v2/admin/tasks` 端點:查看所有任務(含已刪除) - - 支援過濾:`include_deleted=true/false`、`include_files_deleted=true/false` - -### Frontend Changes -5. **Task History Page** - - 用戶只看到自己的任務(已有 user_id 隔離) - - 軟刪除的任務不顯示在列表中 - -6. **Admin Dashboard** - - 新增任務管理視圖 - - 顯示所有任務含狀態標記(已刪除、檔案已清理) - - 可查看累計統計不受刪除影響 - -### Configuration -7. **Config 新增設定項** - - `cleanup_interval_hours`: 清理間隔(預設 24) - - `max_files_per_user`: 每用戶保留最新檔案數(預設 50) - - `cleanup_enabled`: 是否啟用自動清理(預設 true) - -## Impact -- Affected specs: `task-management` -- Affected code: - - `backend/app/models/task.py` - 新增 deleted_at 欄位 - - `backend/app/services/task_service.py` - 軟刪除和查詢邏輯 - - `backend/app/services/cleanup_service.py` - 新檔案 - - `backend/app/routers/admin.py` - 新增端點 - - `backend/app/core/config.py` - 新增設定 - - `frontend/src/pages/AdminDashboardPage.tsx` - 任務管理視圖 -- Database migration required: 新增 `deleted_at` 欄位 diff --git a/openspec/changes/archive/2025-12-14-add-storage-cleanup/specs/task-management/spec.md b/openspec/changes/archive/2025-12-14-add-storage-cleanup/specs/task-management/spec.md deleted file mode 100644 index e4ac06f..0000000 --- a/openspec/changes/archive/2025-12-14-add-storage-cleanup/specs/task-management/spec.md +++ /dev/null @@ -1,116 +0,0 @@ -# task-management Spec Delta - -## ADDED Requirements - -### Requirement: Soft Delete Tasks -The system SHALL support soft deletion of tasks, marking them as deleted without removing database records to preserve usage statistics. - -#### Scenario: User soft deletes a task -- **WHEN** user calls DELETE on `/api/v2/tasks/{task_id}` -- **THEN** system SHALL set `deleted_at` timestamp on the task record -- **AND** system SHALL NOT delete the actual files -- **AND** system SHALL NOT remove the database record -- **AND** subsequent user queries SHALL NOT return this task - -#### Scenario: Preserve statistics after soft delete -- **WHEN** a task is soft deleted -- **THEN** admin statistics endpoints SHALL continue to include this task's metrics -- **AND** translation token counts SHALL remain in cumulative totals -- **AND** processing time statistics SHALL remain accurate - -### Requirement: File Cleanup Scheduler -The system SHALL automatically clean up old files while preserving database records for statistics tracking. - -#### Scenario: Scheduled file cleanup -- **WHEN** cleanup scheduler runs (configurable interval, default daily) -- **THEN** system SHALL identify tasks where files can be deleted -- **AND** system SHALL retain newest N files per user (configurable, default 50) -- **AND** system SHALL delete actual files from disk for older tasks -- **AND** system SHALL set `file_deleted=True` on cleaned tasks -- **AND** system SHALL NOT delete any database records - -#### Scenario: File retention per user -- **WHEN** user has more than `max_files_per_user` tasks with files -- **THEN** cleanup SHALL delete files for oldest tasks exceeding the limit -- **AND** cleanup SHALL preserve the newest `max_files_per_user` task files -- **AND** task ordering SHALL be by `created_at` descending - -#### Scenario: Manual cleanup trigger -- **WHEN** admin calls POST `/api/v2/admin/cleanup/trigger` -- **THEN** system SHALL immediately run the cleanup process -- **AND** return summary of files deleted and space freed - -### Requirement: Admin Task Visibility -Admin users SHALL have full visibility into all tasks including soft-deleted and file-cleaned tasks. - -#### Scenario: Admin lists all tasks -- **WHEN** admin calls GET `/api/v2/admin/tasks` -- **THEN** response SHALL include all tasks from all users -- **AND** response SHALL include soft-deleted tasks -- **AND** response SHALL include tasks with deleted files -- **AND** each task SHALL indicate its deletion status - -#### Scenario: Filter admin task list -- **WHEN** admin calls GET `/api/v2/admin/tasks` with filters -- **THEN** `include_deleted=false` SHALL exclude soft-deleted tasks -- **AND** `include_files_deleted=false` SHALL exclude file-cleaned tasks -- **AND** `user_id={id}` SHALL filter to specific user's tasks - -#### Scenario: View storage usage statistics -- **WHEN** admin calls GET `/api/v2/admin/storage/stats` -- **THEN** response SHALL include total storage used -- **AND** response SHALL include per-user storage breakdown -- **AND** response SHALL include count of tasks with/without files - -### Requirement: User Task Isolation -Regular users SHALL only see their own tasks and soft-deleted tasks SHALL be hidden from their view. - -#### Scenario: User lists own tasks -- **WHEN** authenticated user calls GET `/api/v2/tasks` -- **THEN** response SHALL only include tasks owned by that user -- **AND** response SHALL NOT include soft-deleted tasks -- **AND** response SHALL include tasks with deleted files (showing file unavailable status) - -#### Scenario: User cannot access other user's tasks -- **WHEN** user attempts to access task owned by another user -- **THEN** system SHALL return 404 Not Found -- **AND** system SHALL NOT reveal that the task exists - -## MODIFIED Requirements - -### Requirement: Task Detail View -The frontend SHALL provide a dedicated page for viewing individual task details with processing track information, enhanced preview capabilities, and file availability status. - -#### Scenario: Navigate to task detail page -- **WHEN** user clicks "View Details" button on task in Task History page -- **THEN** browser SHALL navigate to `/tasks/{task_id}` -- **AND** TaskDetailPage component SHALL render - -#### Scenario: Display task information -- **WHEN** TaskDetailPage loads for a valid task ID -- **THEN** page SHALL display task metadata (filename, status, processing time, confidence) -- **AND** page SHALL show markdown preview of OCR results -- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats - -#### Scenario: Download from task detail page -- **WHEN** user clicks download button for a specific format -- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint -- **AND** downloaded file SHALL contain the task's OCR results in requested format - -#### Scenario: Display processing track information -- **WHEN** viewing task processed through dual-track system -- **THEN** page SHALL display processing track used (OCR or Direct) -- **AND** show track-specific metrics (OCR confidence or extraction quality) -- **AND** provide option to reprocess with alternate track if applicable - -#### Scenario: Preview document structure -- **WHEN** user enables structure view -- **THEN** page SHALL display document element hierarchy -- **AND** show bounding boxes overlay on preview -- **AND** highlight different element types (headers, tables, lists) with distinct colors - -#### Scenario: Display file unavailable status -- **WHEN** task has `file_deleted=True` -- **THEN** page SHALL show file unavailable indicator -- **AND** download buttons SHALL be disabled or hidden -- **AND** page SHALL display explanation that files were cleaned up diff --git a/openspec/changes/archive/2025-12-14-add-storage-cleanup/tasks.md b/openspec/changes/archive/2025-12-14-add-storage-cleanup/tasks.md deleted file mode 100644 index 7e9d009..0000000 --- a/openspec/changes/archive/2025-12-14-add-storage-cleanup/tasks.md +++ /dev/null @@ -1,49 +0,0 @@ -# Tasks: Add Storage Cleanup Mechanism - -## 1. Database Schema -- [x] 1.1 Add `deleted_at` column to Task model -- [x] 1.2 Create database migration for deleted_at column -- [x] 1.3 Run migration and verify column exists - -## 2. Task Service Updates -- [x] 2.1 Update `delete_task()` to set `deleted_at` instead of deleting record -- [x] 2.2 Update `get_tasks()` to filter out soft-deleted tasks for regular users -- [x] 2.3 Update `get_task_by_id()` to respect soft delete for regular users -- [x] 2.4 Add `get_all_tasks()` method for admin (includes deleted) - -## 3. Cleanup Service -- [x] 3.1 Create `cleanup_service.py` with file cleanup logic -- [x] 3.2 Implement per-user file retention (keep newest N files) -- [x] 3.3 Add method to calculate storage usage per user -- [x] 3.4 Set `file_deleted=True` after cleaning files - -## 4. Scheduled Cleanup Task -- [x] 4.1 Add cleanup configuration to `config.py` -- [x] 4.2 Create scheduler for periodic cleanup -- [x] 4.3 Add startup hook to register cleanup task -- [x] 4.4 Add manual cleanup trigger endpoint for admin - -## 5. Admin API Endpoints -- [x] 5.1 Add `GET /api/v2/admin/tasks` endpoint -- [x] 5.2 Support filters: `include_deleted`, `include_files_deleted`, `user_id` -- [x] 5.3 Add pagination support -- [x] 5.4 Add storage usage statistics endpoint - -## 6. Frontend Updates -- [x] 6.1 Verify TaskHistoryPage correctly filters by user (existing user_id isolation) -- [x] 6.2 Add admin task management view to AdminDashboardPage -- [x] 6.3 Display soft-deleted and files-cleaned status badges (i18n ready) -- [x] 6.4 Add i18n keys for new UI elements - -## 7. Testing -- [x] 7.1 Test soft delete preserves database record (code verified) -- [x] 7.2 Test user isolation (users see only own tasks - existing) -- [x] 7.3 Test admin sees all tasks including deleted (API verified) -- [x] 7.4 Test file cleanup retains newest N files (code verified) -- [x] 7.5 Test storage statistics calculation (API verified) - -## Notes -- All tasks completed including automatic scheduler -- Cleanup runs automatically at configured interval (default: 24 hours) -- Manual cleanup trigger is also available via admin endpoint -- Scheduler status can be checked via `GET /api/v2/admin/cleanup/status` diff --git a/openspec/changes/archive/2025-12-14-enable-audit-logging/proposal.md b/openspec/changes/archive/2025-12-14-enable-audit-logging/proposal.md deleted file mode 100644 index 3768694..0000000 --- a/openspec/changes/archive/2025-12-14-enable-audit-logging/proposal.md +++ /dev/null @@ -1,52 +0,0 @@ -# Enable Audit Logging - -## Summary -Activate the existing audit logging infrastructure by adding `audit_service.log_event()` calls to key system operations. The audit log table and service already exist but are not being used. - -## Motivation -- Audit logs page exists but shows no data because events are not being recorded -- Security compliance requires tracking of authentication and administrative actions -- Administrators need visibility into system usage and potential security issues - -## Current State -- `AuditLog` model exists in `backend/app/models/audit_log.py` -- `AuditService` with `log_event()` method exists in `backend/app/services/audit_service.py` -- `AuditLogsPage` frontend exists at `/admin/audit-logs` -- Admin API endpoint `GET /api/v2/admin/audit-logs` exists -- **Problem**: No code calls `audit_service.log_event()` - logs are always empty - -## Proposed Changes - -### Events to Log - -| Event Type | Category | Location | Description | -|------------|----------|----------|-------------| -| `auth_login` | authentication | auth.py | User login (success/failure) | -| `auth_logout` | authentication | auth.py | User logout | -| `auth_token_refresh` | authentication | auth.py | Token refresh | -| `task_create` | task | tasks.py | Task created | -| `task_process` | task | tasks.py | Task processing started | -| `task_complete` | task | tasks.py | Task completed | -| `task_delete` | task | tasks.py | Task deleted | -| `admin_cleanup` | admin | admin.py | Manual cleanup triggered | -| `admin_view_users` | admin | admin.py | Admin viewed user list | -| `file_upload` | file | main.py | File uploaded | - -### Implementation Approach -1. Add helper function to extract client info (IP, user agent) from Request -2. Add `audit_service.log_event()` calls to each operation point -3. Ensure all events capture: user_id, IP address, user agent, resource info - -## Non-Goals -- Creating new audit log model (already exists) -- Changing audit log API endpoints (already work) -- Modifying frontend audit logs page (already complete) - -## Affected Specs -- None (infrastructure already in place) - -## Testing -- Verify audit logs appear after login/logout -- Verify task operations are logged -- Verify admin actions are logged -- Check audit logs page displays new entries diff --git a/openspec/changes/archive/2025-12-14-enable-audit-logging/tasks.md b/openspec/changes/archive/2025-12-14-enable-audit-logging/tasks.md deleted file mode 100644 index 95dc7c6..0000000 --- a/openspec/changes/archive/2025-12-14-enable-audit-logging/tasks.md +++ /dev/null @@ -1,33 +0,0 @@ -# Tasks: Enable Audit Logging - -## 1. Helper Utilities -- [x] 1.1 Create helper function to extract client info (IP, user agent) from FastAPI Request - -## 2. Authentication Events -- [x] 2.1 Log `auth_login` on successful/failed login in auth.py -- [x] 2.2 Log `auth_logout` on logout in auth.py -- [ ] 2.3 Log `auth_token_refresh` on token refresh (deferred - low priority) - -## 3. Task Events -- [ ] 3.1 Log `task_create` when task is created (deferred - covered by file_upload) -- [ ] 3.2 Log `task_process` when task processing starts (deferred - background task) -- [ ] 3.3 Log `task_complete` when task completes (deferred - background task) -- [x] 3.4 Log `task_delete` when task is deleted - -## 4. Admin Events -- [x] 4.1 Log `admin_cleanup` when manual cleanup is triggered -- [ ] 4.2 Log `admin_view_users` when admin views user list (deferred - low priority) - -## 5. File Events -- [x] 5.1 Log `file_upload` when file is uploaded - -## 6. Testing -- [ ] 6.1 Verify login creates audit log entry -- [ ] 6.2 Verify task operations create audit log entries -- [ ] 6.3 Verify audit logs page shows entries -- [x] 6.4 Test backend module imports - -## Notes -- Core audit events implemented: login, logout, task delete, file upload, admin cleanup -- Background task events (task_process, task_complete) deferred - would require significant refactoring -- Low priority admin events deferred for future implementation diff --git a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/proposal.md b/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/proposal.md deleted file mode 100644 index f6a903b..0000000 --- a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/proposal.md +++ /dev/null @@ -1,62 +0,0 @@ -# Change: 簡化前端頁面並新增翻譯計費功能 - -## Why - -目前前端存在多個冗餘的頁面和功能,需要精簡以改善使用者體驗和維護性: -1. Tasks 頁面的 JSON/MD 下載功能已不再需要(僅保留 PDF 下載) -2. Export 頁面功能與 Tasks 頁面重疊,且複雜度不符實際需求 -3. Settings 頁面僅管理導出規則,而導出功能即將移除 - -同時,系統已整合 Dify 翻譯服務,需要在管理員儀表板中新增翻譯計費追蹤功能,以便監控 API Token 使用量和成本。 - -## What Changes - -### 1. 移除 Tasks 頁面的 JSON/MD 下載按鈕(前端) -- 已在 TaskDetailPage 移除,確認 ExportPage 中的相關功能一併移除 -- 保留 apiV2.ts 中的 API 方法(維持後端相容性) - -### 2. 移除 Export 頁面(前端) -- 移除 `frontend/src/pages/ExportPage.tsx` -- 從 App.tsx 路由配置移除 `/export` 路由 -- 從 Layout.tsx 導航選單移除 Export 連結 -- 移除 i18n 中 export 相關翻譯(可選,不影響功能) - -### 3. 移除 Settings 頁面(前端) -- 移除 `frontend/src/pages/SettingsPage.tsx` -- 從 App.tsx 路由配置移除 `/settings` 路由 -- 從 Layout.tsx 導航選單移除 Settings 連結 -- 後端 Export Rules API 保留(不影響現有資料) - -### 4. 新增翻譯計費功能(前端 + 後端) - -#### 後端新增: -- 在 `AdminService` 新增 `get_translation_statistics()` 方法 -- 新增 API 端點 `GET /api/v2/admin/translation-stats` -- 返回結構: - - 總翻譯任務數 - - 總 Token 使用量(input_tokens, output_tokens) - - 各語言翻譯統計 - - 預估成本(基於配置的 Token 價格) - -#### 前端新增: -- 在 AdminDashboardPage 新增「翻譯統計」卡片 -- 顯示總 Token 使用量、翻譯次數、預估成本 -- 顯示各目標語言的翻譯分佈 - -## Impact - -- Affected specs: frontend-ui (修改), backend-api (修改) -- Affected code: - - **前端移除**: - - `frontend/src/pages/ExportPage.tsx` - - `frontend/src/pages/SettingsPage.tsx` - - `frontend/src/App.tsx` (路由) - - `frontend/src/components/Layout.tsx` (導航) - - **後端新增**: - - `backend/app/services/admin_service.py` (翻譯統計方法) - - `backend/app/routers/admin.py` (新 API 端點) - - `backend/app/schemas/admin.py` (回應結構) - - **前端新增**: - - `frontend/src/pages/AdminDashboardPage.tsx` (翻譯統計元件) - - `frontend/src/services/apiV2.ts` (新 API 呼叫) - - `frontend/src/types/apiV2.ts` (新類型) diff --git a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/backend-api/spec.md b/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/backend-api/spec.md deleted file mode 100644 index b57c16d..0000000 --- a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/backend-api/spec.md +++ /dev/null @@ -1,22 +0,0 @@ -# Spec Delta: backend-api - -## ADDED Requirements - -### Requirement: Translation Statistics Endpoint -The system SHALL provide a new admin API endpoint for translation usage statistics across all users. - -#### Scenario: Admin requests translation statistics -- GIVEN the admin is authenticated -- WHEN GET /api/v2/admin/translation-stats is called -- THEN the response contains: - - total_translations: number of translation jobs - - total_input_tokens: sum of input tokens used - - total_output_tokens: sum of output tokens used - - estimated_cost: calculated cost based on token pricing - - by_language: breakdown of translations by target language - - recent_translations: list of recent translation activities - -#### Scenario: Non-admin user requests translation statistics -- GIVEN a regular user is authenticated -- WHEN GET /api/v2/admin/translation-stats is called -- THEN the response is 403 Forbidden diff --git a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/frontend-ui/spec.md b/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/frontend-ui/spec.md deleted file mode 100644 index db95276..0000000 --- a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/specs/frontend-ui/spec.md +++ /dev/null @@ -1,31 +0,0 @@ -# Spec Delta: frontend-ui - -## REMOVED Requirements - -- REQ-FE-EXPORT: Export Page - The export page for batch exporting task results is removed. -- REQ-FE-SETTINGS: Settings Page - The settings page for managing export rules is removed. - -## ADDED Requirements - -### Requirement: Translation Statistics in Admin Dashboard -The admin dashboard SHALL display translation usage statistics and estimated costs. - -#### Scenario: Admin views translation statistics -- GIVEN the user is logged in as admin -- WHEN the user views the admin dashboard -- THEN the page displays a translation statistics card showing: - - Total translation count - - Total token usage (input + output tokens) - - Estimated cost based on token pricing - - Breakdown by target language - -## MODIFIED Requirements - -### Requirement: Navigation Menu Updated -The navigation menu SHALL be updated to remove Export and Settings links. - -#### Scenario: User views navigation menu -- GIVEN the user is logged in -- WHEN the user views the sidebar navigation -- THEN the menu shows: Upload, Processing, Results, Task History, Admin (if admin) -- AND the menu does NOT show: Export, Settings diff --git a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/tasks.md b/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/tasks.md deleted file mode 100644 index d1b2e00..0000000 --- a/openspec/changes/archive/2025-12-14-simplify-frontend-add-billing/tasks.md +++ /dev/null @@ -1,39 +0,0 @@ -# Tasks: 簡化前端頁面並新增翻譯計費功能 - -## 1. 移除 Export 頁面 - -- [x] 1.1 從 App.tsx 移除 `/export` 路由 -- [x] 1.2 從 Layout.tsx 導航選單移除 Export 連結 -- [x] 1.3 刪除 `frontend/src/pages/ExportPage.tsx` - -## 2. 移除 Settings 頁面 - -- [x] 2.1 從 App.tsx 移除 `/settings` 路由 -- [x] 2.2 從 Layout.tsx 導航選單移除 Settings 連結 -- [x] 2.3 刪除 `frontend/src/pages/SettingsPage.tsx` - -## 3. 後端翻譯統計 API - -- [x] 3.1 新增 `TranslationLog` model 和 migration -- [x] 3.2 在 `admin_service.py` 新增 `get_translation_statistics()` 方法 -- [x] 3.3 在 `admin.py` router 新增 `GET /admin/translation-stats` 端點 -- [x] 3.4 修改翻譯流程在完成時寫入統計到資料庫 - -## 4. 前端翻譯統計顯示 - -- [x] 4.1 在 `apiV2.ts` 新增 `getTranslationStats()` API 呼叫 -- [x] 4.2 在 `types/apiV2.ts` 新增翻譯統計類型定義 -- [x] 4.3 在 `AdminDashboardPage.tsx` 新增翻譯統計卡片 - -## 5. i18n 翻譯 - -- [ ] 5.1 新增翻譯統計相關中文翻譯 (暫時使用硬編碼) -- [ ] 5.2 新增翻譯統計相關英文翻譯 (暫時使用硬編碼) - -## 6. 測試與驗證 - -- [x] 6.1 驗證 Export/Settings 頁面路由已移除 -- [x] 6.2 驗證導航選單已更新 -- [x] 6.3 驗證 TypeScript 編譯通過 -- [ ] 6.4 測試翻譯統計 API 回傳正確資料 (需要實際翻譯測試) -- [ ] 6.5 測試管理員儀表板顯示翻譯統計 (需要實際測試) diff --git a/openspec/project.md b/openspec/project.md deleted file mode 100644 index 480b5cb..0000000 --- a/openspec/project.md +++ /dev/null @@ -1,341 +0,0 @@ -# Project Context - -## Purpose -Tool_OCR is a web-based application for batch image-to-text conversion with multi-language support and rule-based output formatting. The tool uses a modern frontend-backend separation architecture, designed to process multiple images/PDFs simultaneously, extract text using OCR, and export results in various formats according to user-defined rules. - -**Key Goals:** -- Batch processing of images and PDF files for text extraction via web interface -- Multi-language OCR support (Chinese, English, and other languages) -- Rule-based output formatting and organization -- User-friendly web interface accessible via browser -- Export flexibility (TXT, JSON, Excel, etc.) -- RESTful API for OCR processing - -## Tech Stack - -### Development Environment -- **OS Platform**: WSL2 Ubuntu 24.04 -- **Python Version**: 3.12 -- **Environment Manager**: Python venv -- **Virtual Environment Path**: `./venv` -- **Node.js**: 24.x LTS (via nvm) -- **IDE Recommended**: VS Code with Python + React extensions - -### Backend Technologies -- **Language**: Python 3.10+ -- **Web Framework**: FastAPI (modern, async, auto API docs) -- **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL (deep learning-based, excellent multi-language support) -- **Deep Learning Framework**: PaddlePaddle 3.2.1+ (GPU/CPU support, CUDA 11.8/12.3/12.6+) -- **Structure Analysis**: PP-StructureV3 (layout analysis, table recognition, formula extraction, chart recognition) -- **PDF Processing**: PyPDF2 / pdf2image -- **Image Processing**: Pillow (PIL), OpenCV -- **Data Export**: pandas (Excel), json (JSON) -- **Database**: MySQL (configuration storage, task history) -- **Cache**: Redis (optional, for task queue) -- **Authentication**: JWT - -### Frontend Technologies -- **Framework**: React 18+ -- **Build Tool**: Vite -- **UI Library**: Tailwind CSS + shadcn/ui -- **State Management**: React Query (for API calls) + Zustand (for global state) -- **HTTP Client**: Axios -- **File Upload**: react-dropzone - -### Development Tools -- **Package Manager**: Conda + pip (backend), npm/pnpm (frontend) -- **Deployment**: 1Panel (web-based server management) -- **Process Manager**: systemd / PM2 / Supervisor -- **Web Server**: Nginx (reverse proxy) -- **Testing**: pytest (backend), Vitest (frontend) -- **Code Style**: Black + pylint (Python), ESLint + Prettier (JavaScript/TypeScript) -- **Version Control**: Git - -### Key Libraries (Backend) -- fastapi: Web framework -- uvicorn: ASGI server -- paddleocr: OCR processing -- paddlepaddle: Deep learning framework (GPU/CPU) -- paddlex[ocr]: PP-StructureV3 for layout analysis and chart recognition -- pdf2image: PDF to image conversion -- pillow: Image manipulation -- opencv-python: Advanced image processing -- pandas: Data export to Excel -- pyyaml: Configuration management -- python-jose: JWT authentication -- sqlalchemy: Database ORM -- pydantic: Data validation - -### Key Libraries (Frontend) -- react: UI framework -- vite: Build tool -- tailwindcss: CSS framework -- shadcn/ui: UI components -- axios: HTTP client -- react-query: Server state management -- zustand: Client state management -- react-dropzone: File upload - -## Project Conventions - -### Environment Setup (Backend) -```bash -# Run automated setup script (recommended) -./setup_dev_env.sh - -# Or manually: -# Create Python virtual environment -python3 -m venv venv - -# Activate environment -source venv/bin/activate - -# Install dependencies -pip install -r requirements.txt -``` - -### Environment Setup (Frontend) -```bash -# Navigate to frontend directory -cd frontend - -# Install dependencies -npm install - -# Run dev server -npm run dev -``` - -### Code Style - -#### Backend (Python) -- **Formatter**: Black with line length 100 -- **Naming Conventions**: - - Classes: PascalCase (e.g., `OcrProcessor`, `ImageService`) - - Functions/Methods: snake_case (e.g., `process_image`, `export_results`) - - Constants: UPPER_SNAKE_CASE (e.g., `MAX_BATCH_SIZE`, `DEFAULT_LANG`) - - Private members: prefix with underscore (e.g., `_internal_method`) -- **Docstrings**: Google style for all public functions and classes -- **Type Hints**: Use type hints for function signatures (FastAPI requirement) -- **Imports**: Organized by standard library, third-party, local (separated by blank lines) -- **Encoding**: UTF-8 for all Python files - -#### Frontend (JavaScript/TypeScript) -- **Formatter**: Prettier -- **Naming Conventions**: - - Components: PascalCase (e.g., `ImageUpload`, `ResultsTable`) - - Functions/Variables: camelCase (e.g., `processImage`, `ocrResults`) - - Constants: UPPER_SNAKE_CASE (e.g., `MAX_FILE_SIZE`, `API_BASE_URL`) - - CSS Classes: kebab-case (Tailwind convention) -- **File Structure**: One component per file -- **Imports**: Group by external, internal, types - -### Architecture Patterns - -#### Backend Architecture -- **Layered Architecture**: - - Router Layer (FastAPI routes) - - Service Layer (business logic) - - Data Access Layer (database/file operations) - - Model Layer (Pydantic models) -- **Async/Await**: Use async operations for I/O bound tasks -- **Dependency Injection**: FastAPI's dependency injection for services -- **Error Handling**: Custom exception handlers with proper HTTP status codes -- **Logging**: Structured logging with log levels -- **Background Tasks**: FastAPI BackgroundTasks for long-running OCR jobs - -#### Frontend Architecture -- **Component-Based**: Reusable React components -- **Atomic Design**: atoms → molecules → organisms → templates → pages -- **API Layer**: Centralized API client with React Query -- **State Management**: Server state (React Query) + Client state (Zustand) -- **Routing**: React Router for SPA navigation -- **Error Boundaries**: Graceful error handling in UI - -#### API Design -- **RESTful**: Follow REST conventions -- **Versioning**: API versioned as `/api/v1/...` -- **Documentation**: Auto-generated via FastAPI (Swagger/OpenAPI) -- **Response Format**: Consistent JSON structure - ```json - { - "success": true, - "data": {}, - "message": "Success", - "timestamp": "2025-01-01T00:00:00Z" - } - ``` - -### Testing Strategy - -#### Backend Testing -- **Unit Tests**: Test services, utilities, data models -- **Integration Tests**: Test API endpoints end-to-end -- **Test Framework**: pytest with pytest-asyncio -- **Coverage Target**: Minimum 70% code coverage -- **Test Command**: `pytest tests/ -v --cov=app` - -#### Frontend Testing -- **Component Tests**: Test React components with Vitest + React Testing Library -- **Integration Tests**: Test user workflows -- **E2E Tests**: Optional with Playwright -- **Test Command**: `npm run test` - -### Git Workflow -- **Branching**: Feature branches from main (e.g., `feature/add-pdf-support`) -- **Commits**: Conventional Commits format (e.g., `feat:`, `fix:`, `docs:`) -- **PRs**: Require passing tests before merge -- **Versioning**: Semantic versioning (MAJOR.MINOR.PATCH) - -## Domain Context - -### OCR Concepts -- **Recognition Accuracy**: Depends on image quality, language, and font type -- **Preprocessing**: Image enhancement (contrast, denoising) can improve OCR accuracy -- **Multi-Language**: PaddleOCR supports Chinese, English, Japanese, Korean, and many others -- **Bounding Boxes**: OCR engines detect text regions before recognition -- **Confidence Scores**: Each recognized text has a confidence score (0-1) - -### Document Structure Analysis (PP-StructureV3) -- **Layout Analysis**: Automatic detection of document regions (text, images, tables, charts, formulas) -- **Table Recognition**: Extract table structure and content with support for nested formulas and images -- **Formula Recognition**: Convert mathematical formulas to LaTeX format -- **Chart Recognition** (✅ Enabled with PaddlePaddle 3.2.1+): - - **Chart Type Detection**: Identify bar charts, line charts, pie charts, scatter plots, etc. - - **Data Extraction**: Extract numerical data points from chart visualizations - - **Axis & Legend Parsing**: Recognize axis labels, tick values, and legend information - - **Structured Output**: Convert chart content to JSON or tabular format - - **Performance**: GPU acceleration recommended for best results (2-10 seconds per chart) - - **Accuracy**: >85% for simple charts, >70% for complex multi-axis charts -- **Image Extraction**: Preserve and save embedded images from documents - -### Use Cases -- Digitizing scanned documents and images via web upload -- Extracting text from screenshots for archival -- Processing receipts and invoices for data entry -- Converting image-based PDFs to searchable text -- Batch processing multiple files via drag-and-drop interface - -### Output Rules -- Users can define custom rules for organizing extracted text -- Examples: group by file name pattern, filter by confidence threshold, format as structured data -- Export formats: plain text files, JSON with metadata, Excel spreadsheets - -## Important Constraints - -### Technical Constraints -- **Platform**: Windows 10/11 (development), Docker-based deployment -- **Web Application**: Browser-based interface (Chrome, Firefox, Edge) -- **Local Processing**: All OCR processing happens on backend server (no cloud dependencies) -- **Resource Intensive**: OCR is CPU/GPU intensive; consider task queue for batch processing -- **File Size Limits**: Set max upload size (e.g., 20MB per file, 100MB per batch) -- **Language Models**: PaddleOCR models must be downloaded (~100MB+ per language) -- **Conda Environment**: Backend development must be done within Conda virtual environment -- **Port Range**: Web services must use ports 12010-12019 - -### User Experience Constraints -- **Target Users**: Non-technical users who need simple batch OCR via web -- **Browser Compatibility**: Modern browsers (Chrome 90+, Firefox 88+, Edge 90+) -- **Performance**: UI must show progress feedback during OCR processing -- **Error Messages**: Clear, actionable error messages in Traditional Chinese -- **Responsive Design**: UI should work on desktop and tablet (mobile optional) - -### Business Constraints -- **Open Source**: Use only open-source libraries (no paid API dependencies) -- **Deployment**: 1Panel-based deployment (no Docker required) -- **Offline Capable**: Must work without internet after initial setup (except model downloads) -- **Authentication**: JWT-based auth (optional LDAP integration for enterprise) - -### Security Constraints -- **File Upload**: Validate file types, scan for malware (optional) -- **Authentication**: JWT tokens with expiration -- **CORS**: Configure CORS for frontend-backend communication -- **Input Validation**: Strict validation on all API inputs - -## External Dependencies - -### Database Configuration -- **MySQL Host**: mysql.theaken.com -- **MySQL Port**: 33306 -- **MySQL User**: A060 -- **MySQL Password**: WLeSCi0yhtc7 -- **MySQL Database**: db_A060 -- **MySQL Charset**: utf8mb4 - -### SMTP Configuration (Optional) -- **SMTP Server**: mail.panjit.com.tw -- **SMTP Port**: 25 -- **SMTP TLS**: false -- **SMTP Auth**: false -- **Sender Email**: tool-ocr-system@panjit.com.tw - -### LDAP Configuration (Optional) -- **LDAP Server**: panjit.com.tw -- **LDAP Port**: 389 - -### Conda Environment -- **Environment Name**: `tool_ocr` -- **Python Version**: 3.10 -- **Base Path**: `C:\Users\lin46\.conda\envs\tool_ocr` -- **Activation**: Always activate environment before backend development - -### OCR Models -- **PaddleOCR Models**: Downloaded automatically on first run or manually installed -- **Model Storage**: Local cache directory or Docker volume -- **Supported Languages**: Chinese (simplified/traditional), English, Japanese, Korean, etc. -- **Model Size**: ~100-200MB per language pack - -### System Requirements -- **Python**: 3.10+ (managed by Conda or venv) -- **Node.js**: 18+ (for frontend development and build) -- **RAM**: Minimum 4GB (8GB recommended for batch processing, 16GB+ for GPU usage) -- **Disk Space**: ~2GB for application + models + dependencies -- **OS**: Windows 10/11 (development), WSL2 Ubuntu 24.04 (development), Linux (1Panel deployment server) -- **GPU** (Optional but recommended): - - NVIDIA GPU with CUDA 11.8, 12.3, or 12.6+ support - - GPU Memory: Minimum 4GB (8GB+ recommended for chart recognition) - - WSL2 GPU: NVIDIA CUDA drivers installed for WSL - - Performance: 3-10x speedup for OCR and chart recognition -- **Web Server**: Nginx (for static files and reverse proxy) -- **Process Manager**: Supervisor / PM2 / systemd (for backend service) - -### Port Configuration -- **Backend API**: 12010 (FastAPI via uvicorn) -- **Frontend Dev Server**: 12011 (Vite, development only) -- **Nginx**: 80/443 (production, managed by 1Panel) -- **MySQL**: 33306 (external) -- **Redis**: 6379 (optional, local) - -### Deployment Architecture (1Panel) -- **Development**: Windows with Conda + local Node.js -- **Production**: Linux server managed by 1Panel -- **Backend Deployment**: - - Conda environment on production server - - uvicorn runs FastAPI on port 12010 - - Managed by Supervisor/PM2/systemd for auto-restart -- **Frontend Deployment**: - - Build static files with `npm run build` - - Served by Nginx (configured via 1Panel) - - Nginx reverse proxies `/api` to backend (12010) -- **1Panel Features**: - - Website management (Nginx configuration) - - Process management (backend service) - - SSL certificate management (Let's Encrypt) - - File management and deployment - -### Configuration Files -- **Backend**: - - `environment.yml`: Conda environment specification - - `requirements.txt`: Pip dependencies - - `.env`: Environment variables (database, JWT secret, etc.) - - `config.yaml`: Application configuration - - `start.sh`: Backend startup script -- **Frontend**: - - `package.json`: npm dependencies - - `.env.production`: Production environment variables (API URL) - - `vite.config.js`: Vite configuration - - `build.sh`: Frontend build script -- **Deployment**: - - `nginx.conf`: Nginx reverse proxy configuration - - `supervisor.conf` or `pm2.config.js`: Process manager configuration - - `deploy.sh`: Deployment automation script diff --git a/openspec/specs/document-processing/spec.md b/openspec/specs/document-processing/spec.md deleted file mode 100644 index 771a876..0000000 --- a/openspec/specs/document-processing/spec.md +++ /dev/null @@ -1,183 +0,0 @@ -# document-processing Specification - -## Purpose -TBD - created by archiving change dual-track-document-processing. Update Purpose after archive. -## Requirements -### Requirement: Dual-track Processing -The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs. - -#### Scenario: Process scanned PDF through OCR track -- **WHEN** a scanned PDF is uploaded -- **THEN** the system SHALL detect it requires OCR -- **AND** route it through PaddleOCR PP-StructureV3 pipeline -- **AND** return results in UnifiedDocument format - -#### Scenario: Process editable PDF through direct extraction -- **WHEN** an editable PDF with extractable text is uploaded -- **THEN** the system SHALL detect it can be directly extracted -- **AND** route it through PyMuPDF extraction pipeline -- **AND** return results in UnifiedDocument format without OCR - -#### Scenario: Auto-detect processing track -- **WHEN** a document is uploaded without explicit track specification -- **THEN** the system SHALL analyze the document type and content -- **AND** automatically select the optimal processing track -- **AND** include the selected track in processing metadata - -### Requirement: Document Type Detection -The system SHALL provide intelligent document type detection to determine the optimal processing track. - -#### Scenario: Detect editable PDF -- **WHEN** analyzing a PDF document -- **THEN** the system SHALL check for extractable text content -- **AND** return confidence score for editability -- **AND** recommend "direct" track if text coverage > 90% - -#### Scenario: Detect scanned document -- **WHEN** analyzing an image or scanned PDF -- **THEN** the system SHALL identify lack of extractable text -- **AND** recommend "ocr" track for processing -- **AND** configure appropriate OCR models - -#### Scenario: Detect Office documents -- **WHEN** analyzing .docx, .xlsx, .pptx files -- **THEN** the system SHALL identify Office format -- **AND** route to OCR track for initial implementation -- **AND** preserve option for future direct Office extraction - -### Requirement: Unified Document Model -The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks. - -#### Scenario: Generate UnifiedDocument from OCR -- **WHEN** OCR processing completes -- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument -- **AND** preserve all element types, coordinates, and confidence scores -- **AND** maintain reading order and hierarchical structure - -#### Scenario: Generate UnifiedDocument from direct extraction -- **WHEN** direct extraction completes -- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument -- **AND** preserve text styling, fonts, and exact positioning -- **AND** extract tables with cell boundaries and content - -#### Scenario: Consistent output regardless of track -- **WHEN** processing completes through either track -- **THEN** the output SHALL conform to UnifiedDocument schema -- **AND** include processing_track metadata field -- **AND** support identical downstream operations (PDF generation, translation) - -### Requirement: Enhanced OCR with Full PP-StructureV3 -The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates. - -#### Scenario: Extract comprehensive document structure -- **WHEN** processing through OCR track -- **THEN** the system SHALL use page_result.json['parsing_res_list'] -- **AND** extract all element types including headers, lists, tables, figures -- **AND** preserve layout_bbox coordinates for each element - -#### Scenario: Maintain reading order -- **WHEN** extracting elements from PP-StructureV3 -- **THEN** the system SHALL preserve the reading order from parsing_res_list -- **AND** assign sequential indices to elements -- **AND** support reordering for complex layouts - -#### Scenario: Extract table structure -- **WHEN** PP-StructureV3 identifies a table -- **THEN** the system SHALL extract cell content and boundaries -- **AND** validate cell_boxes coordinates against page boundaries -- **AND** apply fallback detection for invalid coordinates -- **AND** preserve table HTML for structure -- **AND** extract plain text for translation - -#### Scenario: Extract visual elements with paths -- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM) -- **THEN** the system SHALL preserve saved_path for each element -- **AND** include image dimensions and format -- **AND** enable image embedding in output PDF - -### Requirement: Structure-Preserving Translation Foundation -The system SHALL maintain document structure and layout information to support future translation features. - -#### Scenario: Preserve coordinates for translation -- **WHEN** processing any document -- **THEN** the system SHALL retain bbox coordinates for all text elements -- **AND** calculate space requirements for text expansion/contraction -- **AND** maintain element relationships and groupings - -#### Scenario: Extract translatable content -- **WHEN** processing tables and lists -- **THEN** the system SHALL extract plain text content -- **AND** maintain mapping to original structure -- **AND** preserve formatting markers for reconstruction - -#### Scenario: Support layout adjustment -- **WHEN** preparing for translation -- **THEN** the system SHALL identify flexible vs fixed layout regions -- **AND** calculate maximum text expansion ratios -- **AND** preserve non-translatable elements (logos, signatures) - -### Requirement: Generate UnifiedDocument from direct extraction -The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging. - -#### Scenario: Extract tables with cell merging -- **WHEN** direct extraction encounters a table -- **THEN** the system SHALL use PyMuPDF find_tables() API -- **AND** extract cell content with correct rowspan/colspan -- **AND** preserve merged cell boundaries -- **AND** skip placeholder cells covered by merges - -#### Scenario: Filter decoration images -- **WHEN** extracting images from PDF -- **THEN** the system SHALL filter images smaller than minimum area threshold -- **AND** exclude covering/redaction images -- **AND** preserve meaningful content images - -#### Scenario: Preserve text styling with image handling -- **WHEN** direct extraction completes -- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument -- **AND** preserve text styling, fonts, and exact positioning -- **AND** extract tables with cell boundaries, content, and merge info -- **AND** include only meaningful images in output - -### Requirement: Direct Track Background Image Rendering - -The system SHALL render Direct Track PDF output using a full-page background image with an invisible text overlay to preserve visual fidelity while maintaining text extractability. - -#### Scenario: Render Direct Track PDF with background image -- **WHEN** generating Layout PDF for a Direct Track document -- **THEN** the system SHALL render each source PDF page as a full-page background image at 2x resolution -- **AND** overlay invisible text elements using PDF Text Rendering Mode 3 -- **AND** the invisible text SHALL be positioned at original coordinates for accurate selection - -#### Scenario: Handle Office documents (PPT, DOC, XLS) -- **WHEN** processing an Office document converted to PDF -- **THEN** the system SHALL use the same background image + invisible text approach -- **AND** preserve all visual elements including vector graphics, gradients, and complex layouts -- **AND** the converted PDF in result directory SHALL be used as background source - -#### Scenario: Handle native editable PDFs -- **WHEN** processing a native PDF through Direct Track -- **THEN** the system SHALL use the source PDF for background rendering -- **AND** apply the same invisible text overlay approach -- **AND** chart regions SHALL be excluded from the text layer - -### Requirement: Chart Region Text Exclusion - -The system SHALL exclude text elements within chart regions from the invisible text layer to prevent duplicate content and unnecessary translation. - -#### Scenario: Detect chart regions -- **WHEN** classifying page elements for Direct Track -- **THEN** the system SHALL identify elements with type CHART -- **AND** add chart bounding boxes to regions_to_avoid list - -#### Scenario: Exclude chart-internal text from invisible layer -- **WHEN** rendering invisible text layer -- **THEN** the system SHALL skip text elements whose bounding boxes overlap with chart regions -- **AND** chart axis labels, legends, and data labels SHALL NOT be in the invisible text layer -- **AND** these texts remain visible in the background image - -#### Scenario: Chart text not available for translation -- **WHEN** extracting text for translation from a Direct Track document -- **THEN** chart-internal text SHALL NOT be included in translatable elements -- **AND** this is expected behavior as chart labels typically don't require translation - diff --git a/openspec/specs/frontend-ui/spec.md b/openspec/specs/frontend-ui/spec.md deleted file mode 100644 index 980442b..0000000 --- a/openspec/specs/frontend-ui/spec.md +++ /dev/null @@ -1,188 +0,0 @@ -# frontend-ui Specification - -## Purpose -TBD - created by archiving change refactor-frontend-ux-i18n. Update Purpose after archive. -## Requirements -### Requirement: Minimal Login Page Design - -The login page SHALL use a professional minimal design style that is consistent with the rest of the application. - -The login page SHALL NOT include: -- Animated gradient backgrounds -- Floating decorative elements (orbs, particles) -- Pulsing or floating animations -- Marketing claims or statistics -- Feature promotion sections - -The login page SHALL include: -- Centered login form with clean white card -- Application logo and name -- Username and password input fields -- Login button with loading state -- Error message display area -- Simple solid color background - -#### Scenario: Login page renders with minimal design -- **WHEN** user navigates to the login page -- **THEN** the page displays a centered login form -- **AND** no animated decorative elements are visible -- **AND** no marketing content is displayed - -#### Scenario: Login form visual consistency -- **WHEN** comparing login page to internal pages -- **THEN** the visual style (colors, typography, spacing) is consistent - ---- - -### Requirement: Multi-language Support - -The application SHALL support multiple languages with user-selectable language preference. - -Supported languages: -- Traditional Chinese (zh-TW) - Default -- English (en-US) - -The language selection SHALL be persisted in localStorage and restored on page reload. - -#### Scenario: Language switcher available -- **WHEN** user is logged in and viewing any page -- **THEN** a language switcher component is visible in the top navigation bar - -#### Scenario: Switch to English -- **WHEN** user selects English from the language switcher -- **THEN** all UI text immediately changes to English -- **AND** the preference is saved to localStorage - -#### Scenario: Switch to Traditional Chinese -- **WHEN** user selects Traditional Chinese from the language switcher -- **THEN** all UI text immediately changes to Traditional Chinese -- **AND** the preference is saved to localStorage - -#### Scenario: Language preference persistence -- **WHEN** user has previously selected a language preference -- **AND** user reloads the page or returns later -- **THEN** the application displays in the previously selected language - ---- - -### Requirement: Accurate Product Description - -All user-facing text SHALL accurately describe the product capabilities without exaggeration. - -The application SHALL NOT display: -- Unverified accuracy percentages (e.g., "99% accuracy") -- Superlative marketing claims (e.g., "lightning fast", "enterprise-grade") -- Unsubstantiated statistics -- Comparative claims without evidence - -The application MAY display: -- Factual feature descriptions -- Supported file formats -- Authentication method information - -#### Scenario: Login page displays factual information -- **WHEN** user views the login page -- **THEN** only factual product information is displayed -- **AND** no unverified claims are present - -#### Scenario: Feature descriptions are accurate -- **WHEN** any page describes product features -- **THEN** the descriptions are factual and verifiable - -### Requirement: Batch Processing Support - -The system SHALL support batch processing of multiple uploaded files with a single configuration. - -After uploading multiple files, the user SHALL be able to: -- Configure processing settings once for all files -- Start processing all files with one action -- Monitor progress of all files in a unified view - -#### Scenario: Multiple files uploaded -- **WHEN** user uploads multiple files -- **AND** navigates to processing page -- **THEN** the system displays batch processing mode -- **AND** shows all pending tasks in a list - -#### Scenario: Batch configuration -- **WHEN** user is in batch processing mode -- **THEN** user can select a processing strategy (auto/OCR/Direct) -- **AND** user can configure layout model for OCR tasks -- **AND** user can configure preprocessing for OCR tasks -- **AND** settings apply to all applicable tasks - ---- - -### Requirement: Batch Processing Strategy - -The system SHALL support three batch processing strategies: - -1. **Auto Detection** (default): System analyzes each file and selects optimal track -2. **Force OCR**: All files processed with OCR track -3. **Force Direct**: All PDF files processed with Direct track - -#### Scenario: Auto detection strategy -- **WHEN** user selects auto detection strategy -- **THEN** the system analyzes each file before processing -- **AND** assigns OCR or Direct track based on file characteristics - -#### Scenario: Force OCR strategy -- **WHEN** user selects force OCR strategy -- **THEN** all files are processed using OCR track -- **AND** layout model and preprocessing settings are applied - -#### Scenario: Force Direct strategy -- **WHEN** user selects force Direct strategy -- **AND** file is a PDF -- **THEN** the file is processed using Direct track - ---- - -### Requirement: Parallel Processing Limits - -The system SHALL enforce different parallelism limits based on processing track: - -- Direct Track: Maximum 5 concurrent tasks (CPU-based) -- OCR Track: Maximum 1 concurrent task (GPU VRAM constraint) - -Direct and OCR tasks MAY run simultaneously as they use different resources. - -#### Scenario: Direct track parallelism -- **WHEN** batch contains multiple Direct track tasks -- **THEN** up to 5 tasks process concurrently -- **AND** remaining tasks wait in queue - -#### Scenario: OCR track serialization -- **WHEN** batch contains multiple OCR track tasks -- **THEN** only 1 task processes at a time -- **AND** remaining tasks wait in queue - -#### Scenario: Mixed track processing -- **WHEN** batch contains both Direct and OCR tasks -- **THEN** Direct tasks run in parallel pool (max 5) -- **AND** OCR tasks run in serial queue (max 1) -- **AND** both pools operate simultaneously - ---- - -### Requirement: Batch Progress Display - -The system SHALL display unified progress for batch processing. - -Progress display SHALL include: -- Overall progress (completed / total) -- Count by status (processing, completed, failed) -- Individual task status list -- Estimated time remaining (optional) - -#### Scenario: Batch progress monitoring -- **WHEN** batch processing is in progress -- **THEN** user sees overall completion percentage -- **AND** user sees count of tasks in each status -- **AND** user sees status of each individual task - -#### Scenario: Batch completion -- **WHEN** all tasks in batch are completed or failed -- **THEN** user sees final summary -- **AND** user can navigate to results page - diff --git a/openspec/specs/ocr-processing/spec.md b/openspec/specs/ocr-processing/spec.md deleted file mode 100644 index 9317d42..0000000 --- a/openspec/specs/ocr-processing/spec.md +++ /dev/null @@ -1,311 +0,0 @@ -# ocr-processing Specification - -## Purpose -TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive. -## Requirements -### Requirement: OCR Track Gap Filling with Raw OCR Regions - -The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected. - -#### Scenario: Gap filling activates when coverage is low -- **GIVEN** an OCR track processing task -- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions -- **THEN** the system SHALL activate gap filling -- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element -- **AND** supplement these regions as TEXT elements in the output - -#### Scenario: Coverage is determined by IoA (Intersection over Area) -- **GIVEN** a Raw OCR text region with bounding box -- **WHEN** checking if the region is covered by PP-StructureV3 -- **THEN** the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold -- **AND** IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship -- **AND** regions not meeting the IoA criterion SHALL be marked as uncovered - -#### Scenario: Element-type-specific IoA thresholds are applied -- **GIVEN** a Raw OCR region being evaluated for coverage -- **WHEN** comparing against PP-StructureV3 elements of different types -- **THEN** the system SHALL apply different IoA thresholds: - - TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors) - - TABLE: IoA > 0.1 (strict filtering to preserve table structure) - - FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels) -- **AND** a region is considered covered if it meets the threshold for ANY overlapping element - -#### Scenario: Only TEXT elements are supplemented -- **GIVEN** uncovered Raw OCR regions identified for supplementation -- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements -- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements -- **AND** only supplement regions as TEXT type to preserve structural integrity - -#### Scenario: Supplemented regions meet confidence threshold -- **GIVEN** Raw OCR regions to be supplemented -- **WHEN** a region has confidence score below 0.3 -- **THEN** the system SHALL skip that region -- **AND** only supplement regions with confidence >= 0.3 - -#### Scenario: Deduplication uses IoA instead of IoU -- **GIVEN** a Raw OCR region being considered for supplementation -- **WHEN** the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element -- **THEN** the system SHALL skip that region to prevent duplicate text -- **AND** the original PP-StructureV3 element SHALL be preserved - -#### Scenario: Reading order is recalculated after gap filling -- **GIVEN** supplemented elements have been added to the page -- **WHEN** assembling the final element list -- **THEN** the system SHALL recalculate reading order for the entire page -- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right) -- **AND** ensure logical document flow is maintained - -#### Scenario: Coordinate alignment with ocr_dimensions -- **GIVEN** Raw OCR processing may involve image resizing -- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox -- **THEN** the system SHALL use ocr_dimensions to normalize coordinates -- **AND** ensure both sources reference the same coordinate space -- **AND** prevent coverage misdetection due to scale differences - -#### Scenario: Supplemented elements have complete metadata -- **GIVEN** a Raw OCR region being added as supplemented element -- **WHEN** creating the DocumentElement -- **THEN** the element SHALL include page_number -- **AND** include confidence score from Raw OCR -- **AND** include original bbox coordinates -- **AND** optionally include source indicator for debugging - -### Requirement: Gap Filling Track Isolation - -The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs. - -#### Scenario: Gap filling only activates for OCR track -- **GIVEN** a document processing task -- **WHEN** the processing track is OCR -- **THEN** the system SHALL evaluate and apply gap filling as needed -- **AND** produce enhanced output with supplemented content - -#### Scenario: Direct track is unaffected -- **GIVEN** a document processing task with Direct track -- **WHEN** the task is processed -- **THEN** the system SHALL NOT invoke any gap filling logic -- **AND** produce output identical to current Direct track behavior - -#### Scenario: Hybrid track is unaffected -- **GIVEN** a document processing task with Hybrid track -- **WHEN** the task is processed -- **THEN** the system SHALL NOT invoke gap filling logic -- **AND** use existing Hybrid track processing pipeline - -### Requirement: Gap Filling Configuration - -The system SHALL provide configurable parameters for gap filling behavior. - -#### Scenario: Gap filling can be disabled via configuration -- **GIVEN** gap_filling_enabled is set to false in configuration -- **WHEN** OCR track processing runs -- **THEN** the system SHALL skip all gap filling logic -- **AND** output only PP-StructureV3 results as before - -#### Scenario: Coverage threshold is configurable -- **GIVEN** gap_filling_coverage_threshold is set to 0.8 -- **WHEN** PP-StructureV3 coverage is 75% -- **THEN** the system SHALL activate gap filling -- **AND** supplement uncovered regions - -#### Scenario: IoA thresholds are configurable per element type -- **GIVEN** custom IoA thresholds configured: - - gap_filling_ioa_threshold_text: 0.6 - - gap_filling_ioa_threshold_table: 0.1 - - gap_filling_ioa_threshold_figure: 0.8 - - gap_filling_dedup_ioa_threshold: 0.5 -- **WHEN** evaluating coverage and deduplication -- **THEN** the system SHALL use the configured values -- **AND** apply them consistently throughout gap filling process - -#### Scenario: Confidence threshold is configurable -- **GIVEN** gap_filling_confidence_threshold is set to 0.5 -- **WHEN** supplementing Raw OCR regions -- **THEN** the system SHALL only include regions with confidence >= 0.5 -- **AND** filter out lower confidence regions - -#### Scenario: Boundary shrinking reduces edge duplicates -- **GIVEN** gap_filling_shrink_pixels is set to 1 -- **WHEN** evaluating coverage with IoA -- **THEN** the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side -- **AND** this reduces false "uncovered" detection at region boundaries - -### Requirement: Layout Model Selection -The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning. - -#### Scenario: User selects Chinese document model -- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices) -- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S) -- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model -- **AND** the model SHALL be optimized for 23 Chinese document element types -- **AND** table and form detection accuracy SHALL be improved over the default model - -#### Scenario: User selects standard model for English documents -- **GIVEN** a user is processing English academic papers or reports -- **WHEN** the user selects "Standard Model" (PubLayNet-based) -- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model -- **AND** the model SHALL be optimized for English document layouts - -#### Scenario: User selects CDLA model for specialized Chinese layout -- **GIVEN** a user is processing Chinese documents with complex layouts -- **WHEN** the user selects "CDLA Model" -- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model -- **AND** the model SHALL provide specialized Chinese document layout analysis - -#### Scenario: Layout model is sent via API request -- **GIVEN** a frontend application with model selection UI -- **WHEN** the user starts task processing with a selected model -- **THEN** the frontend SHALL send the model choice in the request body: - ```json - POST /api/v2/tasks/{task_id}/start - { - "use_dual_track": true, - "force_track": "ocr", - "language": "ch", - "layout_model": "chinese" - } - ``` -- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model - -#### Scenario: Default model when not specified -- **GIVEN** an API request without `layout_model` parameter -- **WHEN** the task is started -- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model -- **AND** processing SHALL work correctly without requiring model selection - -#### Scenario: Invalid model name is rejected -- **GIVEN** a request with an invalid `layout_model` value -- **WHEN** the user sends `layout_model: "invalid_model"` -- **THEN** the API SHALL return 422 Validation Error -- **AND** provide a clear error message listing valid model options - -### Requirement: Layout Model Selection UI -The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option. - -#### Scenario: Model options are displayed with descriptions -- **GIVEN** the model selection UI is displayed -- **WHEN** the user views the available options -- **THEN** the UI SHALL show the following options: - - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices - - "Standard Model" - for English academic papers, reports - - "CDLA Model" - for specialized Chinese layout analysis -- **AND** each option SHALL have a brief description of its use case - -#### Scenario: Chinese model is selected by default -- **GIVEN** the user opens the task processing interface -- **WHEN** the model selection is displayed -- **THEN** "Chinese Document Model" SHALL be pre-selected as the default -- **AND** the user MAY change the selection before starting processing - -#### Scenario: Model selection is visible only for OCR track -- **GIVEN** a document processing interface -- **WHEN** the user selects processing track -- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected -- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3) - -### Requirement: Model Cache Cleanup - -The system SHALL provide documentation for cleaning up unused model caches to optimize storage space. - -#### Scenario: User wants to free disk space after model upgrade -- **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models -- **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/` -- **AND** list which model directories can be safely removed - -### Requirement: Cell Over-Detection Filtering - -The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells. - -#### Scenario: Cell density exceeds threshold -- **GIVEN** a table detected by PP-StructureV3 with cell_boxes -- **WHEN** cell density exceeds 3.0 cells per 10,000 px² -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Average cell area below threshold -- **GIVEN** a table detected by PP-StructureV3 -- **WHEN** average cell area is less than 3,000 px² -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Cell height too small -- **GIVEN** a table with height H and N cells -- **WHEN** (H / N) is less than 10 pixels -- **THEN** the system SHALL flag the table as over-detected -- **AND** reclassify the table as a TEXT element - -#### Scenario: Valid tables are preserved -- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10) -- **WHEN** validation is applied -- **THEN** the table SHALL be preserved unchanged -- **AND** all cell_boxes SHALL be retained - -### Requirement: Table-to-Text Reclassification - -The system SHALL convert over-detected tables to TEXT elements while preserving content. - -#### Scenario: Table content is preserved -- **GIVEN** a table flagged for reclassification -- **WHEN** converting to TEXT element -- **THEN** the system SHALL extract text content from table HTML -- **AND** preserve the original bounding box -- **AND** set element type to TEXT - -#### Scenario: Reading order is recalculated -- **GIVEN** tables have been reclassified as TEXT -- **WHEN** assembling the final page structure -- **THEN** the system SHALL recalculate reading order -- **AND** sort elements by y0 then x0 coordinates - -### Requirement: Validation Configuration - -The system SHALL provide configurable thresholds for cell validation. - -#### Scenario: Default thresholds are applied -- **GIVEN** no custom configuration is provided -- **WHEN** validating tables -- **THEN** the system SHALL use default thresholds: - - max_cell_density: 3.0 cells/10000px² - - min_avg_cell_area: 3000 px² - - min_cell_height: 10 px - -#### Scenario: Custom thresholds can be configured -- **GIVEN** custom validation thresholds in configuration -- **WHEN** validating tables -- **THEN** the system SHALL use the custom values -- **AND** apply them consistently to all pages - -### Requirement: Use PP-StructureV3 Internal OCR Results - -The system SHALL preferentially use PP-StructureV3's internal OCR results (`overall_ocr_res`) instead of running a separate Raw OCR inference. - -#### Scenario: Extract overall_ocr_res from PP-StructureV3 -- **GIVEN** PP-StructureV3 processing completes -- **WHEN** the result contains `json['res']['overall_ocr_res']` -- **THEN** the system SHALL extract OCR regions from: - - `dt_polys`: detection box polygons - - `rec_texts`: recognized text strings - - `rec_scores`: confidence scores -- **AND** convert these to the standard TextRegion format for gap filling - -#### Scenario: Skip separate Raw OCR when overall_ocr_res is available -- **GIVEN** gap_filling_use_overall_ocr is true (default) -- **WHEN** PP-StructureV3 result contains overall_ocr_res -- **THEN** the system SHALL NOT execute separate PaddleOCR inference -- **AND** use the extracted overall_ocr_res as the OCR source -- **AND** this reduces total inference time by approximately 50% - -#### Scenario: Fallback to separate Raw OCR when needed -- **GIVEN** gap_filling_use_overall_ocr is false OR overall_ocr_res is missing -- **WHEN** gap filling is activated -- **THEN** the system SHALL execute separate PaddleOCR inference as before -- **AND** use the separate OCR results for gap filling -- **AND** this maintains backward compatibility - -#### Scenario: Coordinate consistency is guaranteed -- **GIVEN** overall_ocr_res is extracted from PP-StructureV3 -- **WHEN** comparing with PP-StructureV3 layout elements -- **THEN** both SHALL use the same coordinate system -- **AND** no additional coordinate alignment is needed -- **AND** this prevents scale mismatch issues - diff --git a/openspec/specs/result-export/spec.md b/openspec/specs/result-export/spec.md deleted file mode 100644 index 5712e0d..0000000 --- a/openspec/specs/result-export/spec.md +++ /dev/null @@ -1,207 +0,0 @@ -# result-export Specification - -## Purpose -TBD - created by archiving change fix-v2-api-ui-issues. Update Purpose after archive. -## Requirements -### Requirement: Export Interface - -The Export interface in TaskDetailPage SHALL provide streamlined download options focusing on PDF formats. - -#### Scenario: Download options for completed tasks -- **WHEN** viewing a completed task in TaskDetailPage -- **THEN** the download section SHALL display only two buttons: "版面 PDF" and "流式 PDF" -- **AND** JSON, UnifiedDocument, and Markdown download buttons SHALL NOT be displayed -- **AND** the download grid SHALL use a 2-column layout - -#### Scenario: Translation download options -- **WHEN** viewing completed translations in TaskDetailPage -- **THEN** each translation item SHALL display only a "流式 PDF" download button -- **AND** translation JSON download button SHALL NOT be displayed -- **AND** Layout PDF option for translations SHALL NOT be displayed -- **AND** delete translation button SHALL remain available - -#### Scenario: Backend API remains unchanged -- **WHEN** external clients call download endpoints directly -- **THEN** JSON, Markdown, and UnifiedDocument endpoints SHALL still function -- **AND** translated Layout PDF endpoint SHALL still function -- **AND** no backend changes are required for this frontend simplification - -### Requirement: Multi-Task Export Selection -The Export page SHALL allow users to select and export multiple tasks. - -#### Scenario: Select multiple tasks for export -- **WHEN** Export page loads -- **THEN** page SHALL display list of user's completed tasks -- **AND** page SHALL provide checkboxes to select multiple tasks -- **AND** page SHALL NOT require batch ID from upload store (legacy V1 behavior) - -#### Scenario: Export selected tasks -- **WHEN** user selects multiple tasks and clicks export -- **THEN** system SHALL download each selected task's results in chosen format -- **AND** downloaded files SHALL be named distinctly (e.g., `{task_id}_result.{ext}`) -- **AND** system MAY provide option to download as ZIP archive for multiple files - -### Requirement: Export Configuration Persistence -Export settings (format, thresholds, templates) SHALL apply consistently to V2 task downloads. - -#### Scenario: Apply confidence threshold to export -- **WHEN** user sets confidence threshold to 0.7 and exports -- **THEN** downloaded results SHALL only include OCR text with confidence >= 0.7 -- **AND** threshold SHALL apply via V2 download endpoint query parameters - -#### Scenario: Apply CSS template to PDF export -- **WHEN** user selects CSS template for PDF format -- **THEN** downloaded PDF SHALL use selected styling -- **AND** template SHALL be passed to V2 `/tasks/{id}/download/pdf` endpoint - -### Requirement: Enhanced PDF Export with Layout Preservation - -The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity. - -#### Scenario: OCR Track reflow PDF uses raw OCR regions -- **WHEN** generating reflow PDF for an OCR Track document -- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files -- **AND** text blocks SHALL be sorted by Y coordinate for reading order -- **AND** all text content SHALL match the Layout PDF output -- **AND** images and charts SHALL be embedded from element `saved_path` - -#### Scenario: Direct Track reflow PDF uses structured content -- **WHEN** generating reflow PDF for a Direct Track document -- **THEN** the system SHALL use `content.cells` for table rendering -- **AND** text elements SHALL use `content` string directly -- **AND** images and charts SHALL be embedded from element `saved_path` - -#### Scenario: Reflow PDF content consistency -- **WHEN** comparing Layout PDF and Reflow PDF for the same document -- **THEN** both PDFs SHALL contain the same text content -- **AND** only the presentation format SHALL differ (positioned vs flowing) - -### Requirement: Structure Data Export -The system SHALL provide export formats that preserve document structure for downstream processing. - -#### Scenario: Export structured JSON with hierarchy -- **WHEN** user selects structured JSON format -- **THEN** export SHALL include element hierarchy and relationships -- **AND** preserve parent-child relationships (sections, lists) -- **AND** include style and formatting information - -#### Scenario: Export for translation preparation -- **WHEN** user exports with translation_ready=true parameter -- **THEN** export SHALL include translatable text segments -- **AND** maintain coordinate mappings for each segment -- **AND** mark non-translatable regions - -#### Scenario: Export with layout analysis -- **WHEN** user requests layout analysis export -- **THEN** system SHALL include reading order indices -- **AND** identify layout regions (header, body, footer, sidebar) -- **AND** provide confidence scores for layout detection - -### Requirement: Translation Result JSON Export - -The system SHALL support exporting translation results as independent JSON files following a defined schema. - -#### Scenario: Export translation result JSON -- **WHEN** translation completes for a document -- **THEN** system SHALL save translation to `{filename}_translated_{lang}.json` -- **AND** file SHALL be stored alongside original `{filename}_result.json` -- **AND** original result file SHALL remain unchanged - -#### Scenario: Translation JSON schema compliance -- **WHEN** translation result is saved -- **THEN** JSON SHALL include schema_version field ("1.0.0") -- **AND** SHALL include source_document reference -- **AND** SHALL include source_lang and target_lang -- **AND** SHALL include provider identifier (e.g., "dify") -- **AND** SHALL include translated_at timestamp -- **AND** SHALL include translations dict mapping element_id to translated content - -#### Scenario: Translation statistics in export -- **WHEN** translation result is saved -- **THEN** JSON SHALL include statistics object with: - - total_elements: count of all elements in document - - translated_elements: count of successfully translated elements - - skipped_elements: count of non-translatable elements (images, charts, etc.) - - total_characters: character count of translated text - - processing_time_seconds: translation duration - -#### Scenario: Table cell translation in export -- **WHEN** document contains tables -- **THEN** translation JSON SHALL represent table translations as: - ```json - { - "table_1_0": { - "cells": [ - {"row": 0, "col": 0, "content": "Translated cell text"}, - {"row": 0, "col": 1, "content": "Another cell"} - ] - } - } - ``` -- **AND** row/col positions SHALL match original table structure - -#### Scenario: Download translation result via API -- **WHEN** GET request to `/api/v2/translate/{task_id}/result?lang={lang}` -- **THEN** system SHALL return translation JSON content -- **AND** Content-Type SHALL be application/json -- **AND** response SHALL include appropriate cache headers - -#### Scenario: List available translations -- **WHEN** GET request to `/api/v2/tasks/{task_id}/translations` -- **THEN** system SHALL return list of available translation languages -- **AND** include translation metadata (translated_at, provider, statistics) - -### Requirement: Translated PDF Export API - -The system SHALL expose an API endpoint for downloading translated documents as PDF files. - -#### Scenario: Download translated PDF via API -- **GIVEN** a task with completed translation to English -- **WHEN** POST request to `/api/v2/translate/{task_id}/pdf?lang=en` -- **THEN** system returns PDF file with translated content -- **AND** Content-Type is `application/pdf` -- **AND** Content-Disposition suggests filename like `{task_id}_translated_en.pdf` - -#### Scenario: Download translated PDF with layout preservation -- **WHEN** user downloads translated PDF -- **THEN** the PDF maintains original document layout -- **AND** text positions match original document coordinates -- **AND** images and tables appear at original positions - -#### Scenario: Invalid language parameter -- **GIVEN** a task with translation only to English -- **WHEN** user requests PDF with `lang=ja` (Japanese) -- **THEN** system returns 404 Not Found -- **AND** response includes available languages in error message - -#### Scenario: Task not found -- **GIVEN** non-existent task_id -- **WHEN** user requests translated PDF -- **THEN** system returns 404 Not Found - ---- - -### Requirement: Frontend Translated PDF Download - -The frontend SHALL provide UI controls for downloading translated PDFs. - -#### Scenario: Show download button when translation complete -- **GIVEN** a task with translation status "completed" -- **WHEN** user views TaskDetailPage -- **THEN** page displays "Download Translated PDF" button -- **AND** button shows target language (e.g., "Download Translated PDF (English)") - -#### Scenario: Hide download button when no translation -- **GIVEN** a task without any completed translations -- **WHEN** user views TaskDetailPage -- **THEN** "Download Translated PDF" button is not shown - -#### Scenario: Download progress indication -- **GIVEN** user clicks "Download Translated PDF" button -- **WHEN** PDF generation is in progress -- **THEN** button shows loading state -- **AND** prevents double-click -- **WHEN** download completes -- **THEN** browser downloads PDF file -- **AND** button returns to normal state - diff --git a/openspec/specs/task-management/spec.md b/openspec/specs/task-management/spec.md deleted file mode 100644 index 44f949d..0000000 --- a/openspec/specs/task-management/spec.md +++ /dev/null @@ -1,199 +0,0 @@ -# task-management Specification - -## Purpose -TBD - created by archiving change fix-v2-api-ui-issues. Update Purpose after archive. -## Requirements -### Requirement: Task Result Generation -The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data. - -#### Scenario: Markdown file contains OCR results -- **WHEN** a task completes OCR processing successfully -- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format -- **AND** the file size SHALL be greater than 0 bytes -- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection - -#### Scenario: Result files stored in task directory -- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf` -- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/` -- **AND** both `_result.json` and `_result.md` SHALL exist -- **AND** both files SHALL contain valid OCR output data - -#### Scenario: Include processing track in results -- **WHEN** a task completes through dual-track processing -- **THEN** the JSON result SHALL include "processing_track" field -- **AND** SHALL indicate whether "ocr" or "direct" track was used -- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct) - -#### Scenario: Store UnifiedDocument format -- **WHEN** processing completes through either track -- **THEN** system SHALL save results in UnifiedDocument format -- **AND** maintain backward-compatible JSON structure -- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF - -### Requirement: Task Detail View -The frontend SHALL provide a dedicated page for viewing individual task details with processing track information, enhanced preview capabilities, and file availability status. - -#### Scenario: Navigate to task detail page -- **WHEN** user clicks "View Details" button on task in Task History page -- **THEN** browser SHALL navigate to `/tasks/{task_id}` -- **AND** TaskDetailPage component SHALL render - -#### Scenario: Display task information -- **WHEN** TaskDetailPage loads for a valid task ID -- **THEN** page SHALL display task metadata (filename, status, processing time, confidence) -- **AND** page SHALL show markdown preview of OCR results -- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats - -#### Scenario: Download from task detail page -- **WHEN** user clicks download button for a specific format -- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint -- **AND** downloaded file SHALL contain the task's OCR results in requested format - -#### Scenario: Display processing track information -- **WHEN** viewing task processed through dual-track system -- **THEN** page SHALL display processing track used (OCR or Direct) -- **AND** show track-specific metrics (OCR confidence or extraction quality) -- **AND** provide option to reprocess with alternate track if applicable - -#### Scenario: Preview document structure -- **WHEN** user enables structure view -- **THEN** page SHALL display document element hierarchy -- **AND** show bounding boxes overlay on preview -- **AND** highlight different element types (headers, tables, lists) with distinct colors - -#### Scenario: Display file unavailable status -- **WHEN** task has `file_deleted=True` -- **THEN** page SHALL show file unavailable indicator -- **AND** download buttons SHALL be disabled or hidden -- **AND** page SHALL display explanation that files were cleaned up - -### Requirement: Results Page V2 Migration -The Results page SHALL use V2 task-based APIs instead of V1 batch APIs. - -#### Scenario: Load task results instead of batch -- **WHEN** Results page loads with a task ID in upload store -- **THEN** page SHALL call `apiClientV2.getTask(taskId)` to fetch task details -- **AND** page SHALL NOT call any V1 batch status endpoints -- **AND** task information SHALL display correctly - -#### Scenario: Handle missing task gracefully -- **WHEN** Results page loads without a task ID -- **THEN** page SHALL display helpful message directing user to upload page -- **AND** page SHALL provide button to navigate to `/upload` - -### Requirement: Processing Track Management -The task management system SHALL track and display processing track information for all tasks. - -#### Scenario: Track processing route selection -- **WHEN** a task begins processing -- **THEN** system SHALL record the selected processing track -- **AND** log the reason for track selection -- **AND** store auto-detection confidence score - -#### Scenario: Allow track override -- **WHEN** user views a completed task -- **THEN** system SHALL offer option to reprocess with different track -- **AND** maintain both results for comparison -- **AND** track which result user prefers - -#### Scenario: Display processing metrics -- **WHEN** task completes processing -- **THEN** system SHALL record track-specific metrics -- **AND** OCR track SHALL show confidence scores and character count -- **AND** Direct track SHALL show extraction coverage and structure quality - -### Requirement: Task Processing History -The system SHALL maintain detailed processing history for tasks including track changes and reprocessing. - -#### Scenario: Record reprocessing attempts -- **WHEN** a task is reprocessed with different track -- **THEN** system SHALL maintain processing history -- **AND** store results from each attempt -- **AND** allow comparison between different processing attempts - -#### Scenario: Track quality improvements -- **WHEN** viewing task history -- **THEN** system SHALL show quality metrics over time -- **AND** indicate if reprocessing improved results -- **AND** suggest optimal track based on document characteristics - -#### Scenario: Export processing analytics -- **WHEN** exporting task data -- **THEN** system SHALL include processing history -- **AND** provide track selection statistics -- **AND** include performance metrics for each processing attempt - -### Requirement: Soft Delete Tasks -The system SHALL support soft deletion of tasks, marking them as deleted without removing database records to preserve usage statistics. - -#### Scenario: User soft deletes a task -- **WHEN** user calls DELETE on `/api/v2/tasks/{task_id}` -- **THEN** system SHALL set `deleted_at` timestamp on the task record -- **AND** system SHALL NOT delete the actual files -- **AND** system SHALL NOT remove the database record -- **AND** subsequent user queries SHALL NOT return this task - -#### Scenario: Preserve statistics after soft delete -- **WHEN** a task is soft deleted -- **THEN** admin statistics endpoints SHALL continue to include this task's metrics -- **AND** translation token counts SHALL remain in cumulative totals -- **AND** processing time statistics SHALL remain accurate - -### Requirement: File Cleanup Scheduler -The system SHALL automatically clean up old files while preserving database records for statistics tracking. - -#### Scenario: Scheduled file cleanup -- **WHEN** cleanup scheduler runs (configurable interval, default daily) -- **THEN** system SHALL identify tasks where files can be deleted -- **AND** system SHALL retain newest N files per user (configurable, default 50) -- **AND** system SHALL delete actual files from disk for older tasks -- **AND** system SHALL set `file_deleted=True` on cleaned tasks -- **AND** system SHALL NOT delete any database records - -#### Scenario: File retention per user -- **WHEN** user has more than `max_files_per_user` tasks with files -- **THEN** cleanup SHALL delete files for oldest tasks exceeding the limit -- **AND** cleanup SHALL preserve the newest `max_files_per_user` task files -- **AND** task ordering SHALL be by `created_at` descending - -#### Scenario: Manual cleanup trigger -- **WHEN** admin calls POST `/api/v2/admin/cleanup/trigger` -- **THEN** system SHALL immediately run the cleanup process -- **AND** return summary of files deleted and space freed - -### Requirement: Admin Task Visibility -Admin users SHALL have full visibility into all tasks including soft-deleted and file-cleaned tasks. - -#### Scenario: Admin lists all tasks -- **WHEN** admin calls GET `/api/v2/admin/tasks` -- **THEN** response SHALL include all tasks from all users -- **AND** response SHALL include soft-deleted tasks -- **AND** response SHALL include tasks with deleted files -- **AND** each task SHALL indicate its deletion status - -#### Scenario: Filter admin task list -- **WHEN** admin calls GET `/api/v2/admin/tasks` with filters -- **THEN** `include_deleted=false` SHALL exclude soft-deleted tasks -- **AND** `include_files_deleted=false` SHALL exclude file-cleaned tasks -- **AND** `user_id={id}` SHALL filter to specific user's tasks - -#### Scenario: View storage usage statistics -- **WHEN** admin calls GET `/api/v2/admin/storage/stats` -- **THEN** response SHALL include total storage used -- **AND** response SHALL include per-user storage breakdown -- **AND** response SHALL include count of tasks with/without files - -### Requirement: User Task Isolation -Regular users SHALL only see their own tasks and soft-deleted tasks SHALL be hidden from their view. - -#### Scenario: User lists own tasks -- **WHEN** authenticated user calls GET `/api/v2/tasks` -- **THEN** response SHALL only include tasks owned by that user -- **AND** response SHALL NOT include soft-deleted tasks -- **AND** response SHALL include tasks with deleted files (showing file unavailable status) - -#### Scenario: User cannot access other user's tasks -- **WHEN** user attempts to access task owned by another user -- **THEN** system SHALL return 404 Not Found -- **AND** system SHALL NOT reveal that the task exists - diff --git a/openspec/specs/translation/spec.md b/openspec/specs/translation/spec.md deleted file mode 100644 index 555f233..0000000 --- a/openspec/specs/translation/spec.md +++ /dev/null @@ -1,304 +0,0 @@ -# translation Specification - -## Purpose -TBD - created by archiving change add-document-translation. Update Purpose after archive. -## Requirements -### Requirement: Document Translation Service - -The system SHALL provide a document translation service that translates extracted text from OCR-processed documents into target languages using DIFY AI API. - -#### Scenario: Successful translation of Direct track document -- **GIVEN** a completed OCR task with Direct track processing -- **WHEN** user requests translation to English -- **THEN** the system extracts all translatable elements (text, title, header, footer, paragraph, footnote, table cells) -- **AND** translates them using DIFY AI API -- **AND** saves the result to `{task_id}_translated_en.json` - -#### Scenario: Successful translation of OCR track document -- **GIVEN** a completed OCR task with OCR track processing -- **WHEN** user requests translation to Japanese -- **THEN** the system extracts all translatable elements from UnifiedDocument format -- **AND** translates them preserving element_id mapping -- **AND** saves the result to `{task_id}_translated_ja.json` - -#### Scenario: Successful translation of Hybrid track document -- **GIVEN** a completed OCR task with Hybrid track processing -- **WHEN** translation is requested -- **THEN** the system processes the document using the same unified logic -- **AND** handles any combination of element types present - -#### Scenario: Table cell translation -- **GIVEN** a document containing table elements -- **WHEN** translation is requested -- **THEN** the system extracts text from each table cell -- **AND** translates each cell content individually -- **AND** preserves row/col position in the translation result - ---- - -### Requirement: Translation API Endpoints - -The system SHALL expose REST API endpoints for translation operations. - -#### Scenario: Start translation request -- **GIVEN** a completed OCR task with task_id -- **WHEN** POST request to `/api/v2/translate/{task_id}` with target_lang parameter -- **THEN** the system starts background translation process -- **AND** returns translation job status with 202 Accepted - -#### Scenario: Query translation status -- **GIVEN** an active translation job -- **WHEN** GET request to `/api/v2/translate/{task_id}/status` -- **THEN** the system returns current status (pending, translating, completed, failed) -- **AND** includes progress information (current_element, total_elements) - -#### Scenario: Retrieve translation result -- **GIVEN** a completed translation job -- **WHEN** GET request to `/api/v2/translate/{task_id}/result?lang={target_lang}` -- **THEN** the system returns the translation JSON content - -#### Scenario: Translation for non-existent task -- **GIVEN** an invalid or non-existent task_id -- **WHEN** translation is requested -- **THEN** the system returns 404 Not Found error - ---- - -### Requirement: DIFY API Integration - -The system SHALL integrate with DIFY AI service for translation. - -#### Scenario: API request format -- **GIVEN** text to be translated -- **WHEN** calling DIFY API -- **THEN** the system sends POST request to `/chat-messages` endpoint -- **AND** includes query with translation prompt -- **AND** uses blocking response mode -- **AND** includes user identifier for tracking - -#### Scenario: API response handling -- **GIVEN** DIFY API returns translation response -- **WHEN** parsing the response -- **THEN** the system extracts translated text from `answer` field -- **AND** records usage statistics (tokens, latency) - -#### Scenario: API error handling -- **GIVEN** DIFY API returns error or times out -- **WHEN** handling the error -- **THEN** the system retries up to 3 times with exponential backoff -- **AND** returns appropriate error message if all retries fail - -#### Scenario: API rate limiting -- **GIVEN** high volume of translation requests -- **WHEN** requests approach rate limits -- **THEN** the system queues requests appropriately -- **AND** provides feedback about wait times - ---- - -### Requirement: Translation Prompt Format - -The system SHALL use structured prompts for translation requests. - -#### Scenario: Generate translation prompt -- **GIVEN** source text to translate -- **WHEN** preparing DIFY API request -- **THEN** the system formats prompt as: - ``` - Translate the following text to {language}. - Return ONLY the translated text, no explanations. - - {text} - ``` - -#### Scenario: Language name mapping -- **GIVEN** language code like "zh-TW" or "ja" -- **WHEN** constructing translation prompt -- **THEN** the system maps to full language name (Traditional Chinese, Japanese) - ---- - -### Requirement: Translation Progress Reporting - -The system SHALL provide real-time progress feedback during translation. - -#### Scenario: Progress during multi-element translation -- **GIVEN** a document with 50 translatable elements -- **WHEN** user queries status -- **THEN** the system returns progress like `{"status": "translating", "current_element": 25, "total_elements": 50}` - -#### Scenario: Translation starting status -- **GIVEN** translation job just started -- **WHEN** user queries status -- **THEN** the system returns `{"status": "pending"}` - ---- - -### Requirement: Translation Result Storage - -The system SHALL store translation results as independent JSON files. - -#### Scenario: Save translation result -- **GIVEN** translation completes successfully -- **WHEN** saving results -- **THEN** the system creates `{original_filename}_translated_{lang}.json` -- **AND** includes schema_version, metadata, and translations dict - -#### Scenario: Multiple language translations -- **GIVEN** a document translated to English and Japanese -- **WHEN** checking result files -- **THEN** both `xxx_translated_en.json` and `xxx_translated_ja.json` exist -- **AND** original `xxx_result.json` is unchanged - ---- - -### Requirement: Language Support - -The system SHALL support common languages through DIFY AI service. - -#### Scenario: Common language translation -- **GIVEN** target language is English, Chinese, Japanese, or Korean -- **WHEN** translation is requested -- **THEN** the system includes appropriate language name in prompt -- **AND** executes translation successfully - -#### Scenario: Automatic source language detection -- **GIVEN** source_lang is set to "auto" -- **WHEN** translation is executed -- **THEN** the AI model automatically detects source language -- **AND** translates to target language - -#### Scenario: Supported languages list -- **GIVEN** user queries supported languages -- **WHEN** checking language support -- **THEN** the system provides list including: - - English (en) - - Traditional Chinese (zh-TW) - - Simplified Chinese (zh-CN) - - Japanese (ja) - - Korean (ko) - - German (de) - - French (fr) - - Spanish (es) - - Portuguese (pt) - - Italian (it) - - Russian (ru) - - Vietnamese (vi) - - Thai (th) - -### Requirement: Translated PDF Generation - -The system SHALL support generating PDF files with translated content while preserving the original document layout. - -#### Scenario: Generate translated PDF from Direct track document -- **GIVEN** a completed translation for a Direct track processed document -- **WHEN** user requests translated PDF via `POST /api/v2/translate/{task_id}/pdf?lang={target_lang}` -- **THEN** the system loads the translation JSON file -- **AND** merges translations with UnifiedDocument by element_id -- **AND** generates PDF with translated text at original positions -- **AND** returns PDF file with Content-Type `application/pdf` - -#### Scenario: Generate translated PDF from OCR track document -- **GIVEN** a completed translation for an OCR track processed document -- **WHEN** user requests translated PDF -- **THEN** the system generates PDF preserving all OCR layout information -- **AND** replaces original text with translated content -- **AND** maintains table structure with translated cell content - -#### Scenario: Handle missing translations gracefully -- **GIVEN** a translation JSON missing some element_id entries -- **WHEN** generating translated PDF -- **THEN** the system uses original content for missing translations -- **AND** logs warning for each fallback -- **AND** completes PDF generation successfully - -#### Scenario: Translated PDF for incomplete translation -- **GIVEN** a task with translation status "pending" or "translating" -- **WHEN** user requests translated PDF -- **THEN** the system returns 400 Bad Request -- **AND** includes error message indicating translation not complete - -#### Scenario: Translated PDF for non-existent translation -- **GIVEN** a task that has not been translated to requested language -- **WHEN** user requests translated PDF with `lang=fr` -- **THEN** the system returns 404 Not Found -- **AND** includes error message indicating no translation for language - ---- - -### Requirement: Translation Merge Service - -The system SHALL provide a service to merge translation data with UnifiedDocument. - -#### Scenario: Merge text element translations -- **GIVEN** a UnifiedDocument with text elements -- **AND** a translation JSON with matching element_ids -- **WHEN** applying translations -- **THEN** the system replaces content field for each matched element -- **AND** preserves all other element properties (bounding_box, style_info, etc.) - -#### Scenario: Merge table cell translations -- **GIVEN** a UnifiedDocument containing table elements -- **AND** a translation JSON with table_cell translations like: - ```json - { - "table_1_0": { - "cells": [{"row": 0, "col": 0, "content": "Translated"}] - } - } - ``` -- **WHEN** applying translations -- **THEN** the system updates cell content at matching row/col positions -- **AND** preserves cell structure and styling - -#### Scenario: Non-destructive merge operation -- **GIVEN** a UnifiedDocument -- **WHEN** applying translations -- **THEN** the system creates a modified copy -- **AND** original UnifiedDocument remains unchanged - -### Requirement: Translation Output as Reflow PDF - -The system SHALL generate translated documents as reflow-layout PDFs with real visible text, separate from the Layout PDF which uses background images. - -#### Scenario: Generate translated PDF with reflow layout -- **WHEN** translation is completed for a document -- **THEN** the system SHALL generate a new PDF with translated text -- **AND** the translated PDF SHALL use reflow layout (not background image) -- **AND** text SHALL be real visible text, not invisible overlay -- **AND** page breaks SHALL correspond to original document pages - -#### Scenario: Maintain page correspondence in translated output -- **WHEN** generating translated PDF -- **THEN** content from original page 1 SHALL appear in translated page 1 -- **AND** content from original page 2 SHALL appear in translated page 2 -- **AND** each page may have different content length but maintains page boundaries - -#### Scenario: Chart text excluded from translation -- **WHEN** extracting text for translation from Direct Track documents -- **THEN** text elements within chart regions SHALL NOT be included -- **AND** chart labels, axis text, and legends SHALL remain untranslated -- **AND** this is expected behavior documented for users - -### Requirement: Dual PDF Output Concept - -The system SHALL maintain clear separation between Layout PDF (preview) and Translated PDF (output). - -#### Scenario: Layout PDF for preview -- **WHEN** user views a processed document before translation -- **THEN** the Layout PDF SHALL be displayed -- **AND** Layout PDF preserves exact visual appearance of source -- **AND** text is invisible overlay for extraction purposes only - -#### Scenario: Translated PDF for final output -- **WHEN** user requests translated document -- **THEN** the Translated PDF SHALL be generated -- **AND** Translated PDF uses reflow layout with visible translated text -- **AND** original visual styling is not preserved (text-focused output) - -#### Scenario: Both PDFs available after translation -- **WHEN** translation is completed -- **THEN** both Layout PDF and Translated PDF SHALL be available for download -- **AND** user can choose which version to download -- **AND** Layout PDF remains unchanged after translation -