chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md
+++ b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md
@@ -0,0 +1,175 @@
+# Export Results Specification
+
+## ADDED Requirements
+
+### Requirement: Plain Text Export
+The system SHALL export OCR results as plain text files with configurable formatting.
+
+#### Scenario: Export single file result as TXT
+- **WHEN** user selects a completed OCR task and chooses TXT export
+- **THEN** the system generates a .txt file with extracted text
+- **AND** preserves line breaks based on bounding box positions
+- **AND** returns downloadable file
+
+#### Scenario: Export batch results as TXT
+- **WHEN** user exports a batch with 5 files as TXT
+- **THEN** the system creates a ZIP file containing 5 .txt files
+- **AND** names each file as `{original_filename}_ocr.txt`
+- **AND** returns the ZIP for download
+
+### Requirement: JSON Export
+The system SHALL export OCR results as structured JSON with full metadata.
+
+#### Scenario: Export with metadata
+- **WHEN** user selects JSON export format
+- **THEN** the system generates JSON containing:
+  - File information (name, size, format)
+  - OCR results array with text, bounding boxes, confidence
+  - Processing metadata (timestamp, language, model version)
+  - Task status and statistics
+
+#### Scenario: JSON export example structure
+- **WHEN** export is generated
+- **THEN** JSON structure follows this format:
+```json
+{
+  "file_name": "document.png",
+  "file_size": 1024000,
+  "upload_time": "2025-01-01T10:00:00Z",
+  "processing_time": 2.5,
+  "language": "zh-TW",
+  "results": [
+    {
+      "text": "範例文字",
+      "bbox": [100, 50, 200, 80],
+      "confidence": 0.95
+    }
+  ],
+  "status": "completed"
+}
+```
+
+### Requirement: Excel Export
+The system SHALL export OCR results as Excel spreadsheets with tabular format.
+
+#### Scenario: Single file Excel export
+- **WHEN** user selects Excel export for one file
+- **THEN** the system generates .xlsx file with columns:
+  - Row Number
+  - Recognized Text
+  - Confidence Score
+  - Bounding Box (X, Y, Width, Height)
+  - Language
+
+#### Scenario: Batch Excel export with multiple sheets
+- **WHEN** user exports batch with 3 files as Excel
+- **THEN** the system creates one .xlsx file with 3 sheets
+- **AND** names each sheet as the original filename
+- **AND** includes summary sheet with statistics
+
+### Requirement: Rule-Based Output Formatting
+The system SHALL apply user-defined rules to format exported text.
+
+#### Scenario: Group by filename pattern
+- **WHEN** user defines rule "group files with prefix 'invoice_'"
+- **THEN** the system groups all matching files together
+- **AND** exports them in a single combined file or folder
+
+#### Scenario: Filter by confidence threshold
+- **WHEN** user sets export rule "minimum confidence 0.8"
+- **THEN** the system excludes text with confidence < 0.8 from export
+- **AND** includes only high-confidence results
+
+#### Scenario: Custom text formatting
+- **WHEN** user defines rule "add line numbers"
+- **THEN** the system prepends line numbers to each text line
+- **AND** formats output as: `1. 第一行文字\n2. 第二行文字`
+
+#### Scenario: Sort by reading order
+- **WHEN** user enables "sort by position" rule
+- **THEN** the system orders text by vertical position (top to bottom)
+- **AND** then by horizontal position (left to right) within each row
+- **AND** exports text in natural reading order
+
+### Requirement: Export Rule Configuration
+The system SHALL allow users to save and reuse export rules.
+
+#### Scenario: Save custom export rule
+- **WHEN** user creates a rule with name "高品質發票輸出"
+- **THEN** the system saves the rule to database
+- **AND** associates it with the user account
+- **AND** makes it available in rule selection dropdown
+
+#### Scenario: Apply saved rule
+- **WHEN** user selects a saved rule for export
+- **THEN** the system applies all configured filters and formatting
+- **AND** generates output according to rule settings
+
+#### Scenario: Edit existing rule
+- **WHEN** user modifies a saved rule
+- **THEN** the system updates the rule configuration
+- **AND** preserves the rule ID for continuity
+
+### Requirement: Markdown Export with Structure and Images
+The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images.
+
+#### Scenario: Export as Markdown with structure and images
+- **WHEN** user selects Markdown export format
+- **THEN** the system generates .md file with logical structure
+- **AND** includes headings, paragraphs, tables, lists in proper hierarchy
+- **AND** embeds image references pointing to extracted images (![](./images/img1.jpg))
+- **AND** maintains reading order from OCR analysis
+- **AND** includes extracted images in an images/ folder
+
+#### Scenario: Batch Markdown export with images
+- **WHEN** user exports batch with 5 files as Markdown
+- **THEN** the system creates 5 separate .md files
+- **AND** creates corresponding images/ folders for each document
+- **AND** optionally creates combined .md with page separators
+- **AND** returns ZIP file containing all Markdown files and images
+
+### Requirement: Searchable PDF Export with Images
+The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout).
+
+#### Scenario: Single document PDF export with images
+- **WHEN** user requests PDF export from OCR result
+- **THEN** the system converts Markdown to HTML with basic CSS styling
+- **AND** embeds extracted images from images/ folder
+- **AND** generates PDF using Pandoc + WeasyPrint
+- **AND** preserves document hierarchy, tables, and reading order
+- **AND** images appear near their logical position in text flow
+- **AND** uses appropriate Chinese font (Noto Sans CJK)
+- **AND** produces searchable PDF with selectable text
+
+#### Scenario: Basic PDF formatting options
+- **WHEN** user selects PDF export
+- **THEN** the system applies basic readable formatting
+- **AND** sets standard margins and page size (A4)
+- **AND** uses consistent fonts and spacing
+- **AND** ensures images fit within page width
+- **NOTE** CSS templates are for basic readability, not for replicating original visual design
+
+#### Scenario: Batch PDF export with images
+- **WHEN** user exports batch as PDF
+- **THEN** the system generates individual PDF for each document with embedded images
+- **OR** creates single merged PDF with page breaks
+- **AND** maintains consistent formatting across all pages
+- **AND** returns ZIP of PDFs or single merged PDF
+
+### Requirement: Export Format Selection
+The system SHALL provide UI for selecting export format and options.
+
+#### Scenario: Format selection with preview
+- **WHEN** user opens export dialog
+- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**)
+- **AND** shows preview of output structure for selected format
+- **AND** allows applying custom rules for text filtering
+- **AND** provides basic formatting option for PDF (standard readable format)
+
+#### Scenario: Batch export with format choice
+- **WHEN** user selects multiple completed tasks
+- **THEN** the system enables batch export button
+- **AND** prompts for format selection
+- **AND** generates combined export file
+- **AND** shows progress bar for PDF generation (slower due to image processing)
+- **AND** includes all extracted images when exporting Markdown or PDF
--- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md
+++ b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/file-management/spec.md
@@ -0,0 +1,96 @@
+# File Management Specification
+
+## ADDED Requirements
+
+### Requirement: File Upload Validation
+The system SHALL validate uploaded files for type, size, and content before processing.
+
+#### Scenario: Valid image upload
+- **WHEN** user uploads a PNG file of 5MB
+- **THEN** the system accepts the file
+- **AND** stores it in temporary upload directory
+- **AND** returns upload success with file ID
+
+#### Scenario: Oversized file rejection
+- **WHEN** user uploads a file larger than 20MB
+- **THEN** the system rejects the file
+- **AND** returns error message "文件大小超過限制 (最大 20MB)"
+- **AND** does not store the file
+
+#### Scenario: Invalid file type rejection
+- **WHEN** user uploads a .exe or .zip file
+- **THEN** the system rejects the file
+- **AND** returns error message "不支援的文件類型,僅支援 PNG, JPG, JPEG, PDF"
+
+#### Scenario: Corrupted image detection
+- **WHEN** user uploads a corrupted image file
+- **THEN** the system attempts to open the file
+- **AND** detects corruption during validation
+- **AND** returns error message "文件損壞,無法處理"
+
+### Requirement: Supported File Formats
+The system SHALL support PNG, JPG, JPEG, and PDF file formats for OCR processing.
+
+#### Scenario: PNG image processing
+- **WHEN** user uploads a .png file
+- **THEN** the system processes it directly with PaddleOCR
+
+#### Scenario: JPG/JPEG image processing
+- **WHEN** user uploads a .jpg or .jpeg file
+- **THEN** the system processes it directly with PaddleOCR
+
+#### Scenario: PDF file processing
+- **WHEN** user uploads a .pdf file
+- **THEN** the system converts PDF pages to images using pdf2image
+- **AND** processes each page image with PaddleOCR
+
+### Requirement: Batch Upload Management
+The system SHALL manage multiple file uploads with batch organization.
+
+#### Scenario: Create batch from multiple files
+- **WHEN** user uploads 5 files in a single request
+- **THEN** the system creates a batch with unique batch_id
+- **AND** associates all files with the batch_id
+- **AND** returns batch_id and file list
+
+#### Scenario: Query batch status
+- **WHEN** user requests batch status by batch_id
+- **THEN** the system returns:
+  - Total files in batch
+  - Completed count
+  - Failed count
+  - Processing count
+  - Overall batch status (pending/processing/completed/failed)
+
+### Requirement: File Storage Management
+The system SHALL store uploaded files temporarily and clean up after processing.
+
+#### Scenario: Temporary file storage
+- **WHEN** user uploads files
+- **THEN** the system stores files in `uploads/{batch_id}/` directory
+- **AND** generates unique filenames to prevent conflicts
+
+#### Scenario: Automatic cleanup after processing
+- **WHEN** OCR processing completes for a batch
+- **THEN** the system keeps files for 24 hours
+- **AND** automatically deletes files after retention period
+- **AND** preserves OCR results in database
+
+#### Scenario: Manual file deletion
+- **WHEN** user requests to delete a batch
+- **THEN** the system removes all associated files from storage
+- **AND** marks the batch as deleted in database
+- **AND** returns deletion confirmation
+
+### Requirement: File Access Control
+The system SHALL ensure users can only access their own uploaded files.
+
+#### Scenario: User accesses own files
+- **WHEN** authenticated user requests file by file_id
+- **THEN** the system verifies ownership
+- **AND** returns file if user is the owner
+
+#### Scenario: User attempts to access others' files
+- **WHEN** user requests file_id belonging to another user
+- **THEN** the system denies access
+- **AND** returns 403 Forbidden error
--- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md
@@ -0,0 +1,125 @@
+# OCR Processing Specification
+
+## ADDED Requirements
+
+### Requirement: Multi-Language Text Recognition with Structure Analysis
+The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
+
+#### Scenario: Single image OCR with Chinese text
+- **WHEN** user uploads a PNG image containing Chinese text
+- **THEN** the system extracts text with bounding boxes and confidence scores
+- **AND** returns structured JSON with recognized text, coordinates, and language detected
+- **AND** generates Markdown output preserving text layout and hierarchy
+
+#### Scenario: PDF document OCR with layout preservation
+- **WHEN** user uploads a multi-page PDF file
+- **THEN** the system processes each page with PaddleOCR-VL
+- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
+- **AND** returns Markdown organized by page with preserved reading order
+- **AND** provides JSON with detailed layout structure and bounding boxes
+
+#### Scenario: Mixed language content
+- **WHEN** user uploads an image with both Chinese and English text
+- **THEN** the system detects and extracts text in both languages
+- **AND** preserves the spatial relationship between text regions
+- **AND** maintains proper reading order in output Markdown
+
+#### Scenario: Complex document with tables and images
+- **WHEN** user uploads a scanned document containing tables, images, and text
+- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
+- **AND** extracts table structure as Markdown tables
+- **AND** extracts and saves document images as separate files
+- **AND** embeds image references in Markdown (![](path/to/image.jpg))
+- **AND** preserves document hierarchy and reading order in Markdown output
+
+### Requirement: Batch Processing
+The system SHALL process multiple files concurrently with progress tracking and error handling.
+
+#### Scenario: Batch upload success
+- **WHEN** user uploads 10 image files simultaneously
+- **THEN** the system creates a batch task with unique batch ID
+- **AND** processes files in parallel (up to configured worker limit)
+- **AND** returns real-time progress updates via WebSocket or polling
+
+#### Scenario: Batch processing with partial failure
+- **WHEN** a batch contains 5 valid images and 2 corrupted files
+- **THEN** the system processes all valid files successfully
+- **AND** logs errors for corrupted files with specific error messages
+- **AND** marks the batch as "partially completed"
+
+### Requirement: Image Preprocessing
+The system SHALL provide optional image preprocessing to improve OCR accuracy.
+
+#### Scenario: Low contrast image enhancement
+- **WHEN** user enables preprocessing for a low-contrast image
+- **THEN** the system applies contrast adjustment and denoising
+- **AND** performs OCR on the enhanced image
+- **AND** returns better accuracy compared to original
+
+#### Scenario: Skipped preprocessing
+- **WHEN** user disables preprocessing option
+- **THEN** the system performs OCR directly on original image
+- **AND** completes processing faster
+
+### Requirement: Confidence Threshold Filtering
+The system SHALL filter OCR results based on configurable confidence threshold.
+
+#### Scenario: High confidence filter
+- **WHEN** user sets confidence threshold to 0.8
+- **THEN** the system returns only text segments with confidence >= 0.8
+- **AND** discards low-confidence results
+
+#### Scenario: Include all results
+- **WHEN** user sets confidence threshold to 0.0
+- **THEN** the system returns all recognized text regardless of confidence
+- **AND** includes confidence scores in output
+
+### Requirement: OCR Result Structure
+The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
+
+#### Scenario: Successful OCR result with multiple formats
+- **WHEN** OCR processing completes successfully
+- **THEN** the system returns JSON containing:
+  - File metadata (name, size, format, upload timestamp)
+  - Detected text regions with bounding boxes (x, y, width, height)
+  - Recognized text content for each region
+  - Confidence scores (0.0 to 1.0)
+  - Language detected
+  - Layout element types (title, paragraph, table, image, formula)
+  - Reading order sequence
+  - List of extracted image files with paths
+  - Processing time
+  - Task status (completed/failed/partial)
+- **AND** generates Markdown file with logical structure
+- **AND** saves extracted images to storage directory
+- **AND** provides methods to export as searchable PDF with images
+
+#### Scenario: Searchable PDF generation with images
+- **WHEN** user requests PDF export from OCR results
+- **THEN** the system converts Markdown to HTML with basic CSS styling
+- **AND** embeds extracted images in their logical positions (not exact original positions)
+- **AND** generates PDF using Pandoc + WeasyPrint
+- **AND** preserves document hierarchy, tables, and reading order
+- **AND** applies appropriate fonts for Chinese characters
+- **AND** produces searchable PDF (text is selectable and searchable)
+
+### Requirement: Document Translation (Reserved Architecture)
+The system SHALL provide architecture and UI placeholders for future document translation features.
+
+#### Scenario: Translation option visibility (UI placeholder)
+- **WHEN** user views OCR result page
+- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
+- **AND** shows target language selection dropdown (disabled)
+- **AND** provides tooltip: "Translation feature will be available in future release"
+
+#### Scenario: Translation API endpoint (reserved)
+- **WHEN** backend API is queried for translation endpoints
+- **THEN** the system provides `/api/v1/translate/document` endpoint specification
+- **AND** returns "Not Implemented" (501) status when called
+- **AND** documents expected request/response format for future implementation
+
+#### Scenario: Translation configuration storage (database schema)
+- **WHEN** database schema is created
+- **THEN** the system includes `translation_configs` table
+- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
+- **AND** table remains empty until translation feature is implemented