chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md
+++ b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/export-results/spec.md
@@ -0,0 +1,175 @@
+# Export Results Specification
+
+## ADDED Requirements
+
+### Requirement: Plain Text Export
+The system SHALL export OCR results as plain text files with configurable formatting.
+
+#### Scenario: Export single file result as TXT
+- **WHEN** user selects a completed OCR task and chooses TXT export
+- **THEN** the system generates a .txt file with extracted text
+- **AND** preserves line breaks based on bounding box positions
+- **AND** returns downloadable file
+
+#### Scenario: Export batch results as TXT
+- **WHEN** user exports a batch with 5 files as TXT
+- **THEN** the system creates a ZIP file containing 5 .txt files
+- **AND** names each file as `{original_filename}_ocr.txt`
+- **AND** returns the ZIP for download
+
+### Requirement: JSON Export
+The system SHALL export OCR results as structured JSON with full metadata.
+
+#### Scenario: Export with metadata
+- **WHEN** user selects JSON export format
+- **THEN** the system generates JSON containing:
+  - File information (name, size, format)
+  - OCR results array with text, bounding boxes, confidence
+  - Processing metadata (timestamp, language, model version)
+  - Task status and statistics
+
+#### Scenario: JSON export example structure
+- **WHEN** export is generated
+- **THEN** JSON structure follows this format:
+```json
+{
+  "file_name": "document.png",
+  "file_size": 1024000,
+  "upload_time": "2025-01-01T10:00:00Z",
+  "processing_time": 2.5,
+  "language": "zh-TW",
+  "results": [
+    {
+      "text": "範例文字",
+      "bbox": [100, 50, 200, 80],
+      "confidence": 0.95
+    }
+  ],
+  "status": "completed"
+}
+```
+
+### Requirement: Excel Export
+The system SHALL export OCR results as Excel spreadsheets with tabular format.
+
+#### Scenario: Single file Excel export
+- **WHEN** user selects Excel export for one file
+- **THEN** the system generates .xlsx file with columns:
+  - Row Number
+  - Recognized Text
+  - Confidence Score
+  - Bounding Box (X, Y, Width, Height)
+  - Language
+
+#### Scenario: Batch Excel export with multiple sheets
+- **WHEN** user exports batch with 3 files as Excel
+- **THEN** the system creates one .xlsx file with 3 sheets
+- **AND** names each sheet as the original filename
+- **AND** includes summary sheet with statistics
+
+### Requirement: Rule-Based Output Formatting
+The system SHALL apply user-defined rules to format exported text.
+
+#### Scenario: Group by filename pattern
+- **WHEN** user defines rule "group files with prefix 'invoice_'"
+- **THEN** the system groups all matching files together
+- **AND** exports them in a single combined file or folder
+
+#### Scenario: Filter by confidence threshold
+- **WHEN** user sets export rule "minimum confidence 0.8"
+- **THEN** the system excludes text with confidence < 0.8 from export
+- **AND** includes only high-confidence results
+
+#### Scenario: Custom text formatting
+- **WHEN** user defines rule "add line numbers"
+- **THEN** the system prepends line numbers to each text line
+- **AND** formats output as: `1. 第一行文字\n2. 第二行文字`
+
+#### Scenario: Sort by reading order
+- **WHEN** user enables "sort by position" rule
+- **THEN** the system orders text by vertical position (top to bottom)
+- **AND** then by horizontal position (left to right) within each row
+- **AND** exports text in natural reading order
+
+### Requirement: Export Rule Configuration
+The system SHALL allow users to save and reuse export rules.
+
+#### Scenario: Save custom export rule
+- **WHEN** user creates a rule with name "高品質發票輸出"
+- **THEN** the system saves the rule to database
+- **AND** associates it with the user account
+- **AND** makes it available in rule selection dropdown
+
+#### Scenario: Apply saved rule
+- **WHEN** user selects a saved rule for export
+- **THEN** the system applies all configured filters and formatting
+- **AND** generates output according to rule settings
+
+#### Scenario: Edit existing rule
+- **WHEN** user modifies a saved rule
+- **THEN** the system updates the rule configuration
+- **AND** preserves the rule ID for continuity
+
+### Requirement: Markdown Export with Structure and Images
+The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images.
+
+#### Scenario: Export as Markdown with structure and images
+- **WHEN** user selects Markdown export format
+- **THEN** the system generates .md file with logical structure
+- **AND** includes headings, paragraphs, tables, lists in proper hierarchy
+- **AND** embeds image references pointing to extracted images (![](./images/img1.jpg))
+- **AND** maintains reading order from OCR analysis
+- **AND** includes extracted images in an images/ folder
+
+#### Scenario: Batch Markdown export with images
+- **WHEN** user exports batch with 5 files as Markdown
+- **THEN** the system creates 5 separate .md files
+- **AND** creates corresponding images/ folders for each document
+- **AND** optionally creates combined .md with page separators
+- **AND** returns ZIP file containing all Markdown files and images
+
+### Requirement: Searchable PDF Export with Images
+The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout).
+
+#### Scenario: Single document PDF export with images
+- **WHEN** user requests PDF export from OCR result
+- **THEN** the system converts Markdown to HTML with basic CSS styling
+- **AND** embeds extracted images from images/ folder
+- **AND** generates PDF using Pandoc + WeasyPrint
+- **AND** preserves document hierarchy, tables, and reading order
+- **AND** images appear near their logical position in text flow
+- **AND** uses appropriate Chinese font (Noto Sans CJK)
+- **AND** produces searchable PDF with selectable text
+
+#### Scenario: Basic PDF formatting options
+- **WHEN** user selects PDF export
+- **THEN** the system applies basic readable formatting
+- **AND** sets standard margins and page size (A4)
+- **AND** uses consistent fonts and spacing
+- **AND** ensures images fit within page width
+- **NOTE** CSS templates are for basic readability, not for replicating original visual design
+
+#### Scenario: Batch PDF export with images
+- **WHEN** user exports batch as PDF
+- **THEN** the system generates individual PDF for each document with embedded images
+- **OR** creates single merged PDF with page breaks
+- **AND** maintains consistent formatting across all pages
+- **AND** returns ZIP of PDFs or single merged PDF
+
+### Requirement: Export Format Selection
+The system SHALL provide UI for selecting export format and options.
+
+#### Scenario: Format selection with preview
+- **WHEN** user opens export dialog
+- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**)
+- **AND** shows preview of output structure for selected format
+- **AND** allows applying custom rules for text filtering
+- **AND** provides basic formatting option for PDF (standard readable format)
+
+#### Scenario: Batch export with format choice
+- **WHEN** user selects multiple completed tasks
+- **THEN** the system enables batch export button
+- **AND** prompts for format selection
+- **AND** generates combined export file
+- **AND** shows progress bar for PDF generation (slower due to image processing)
+- **AND** includes all extracted images when exporting Markdown or PDF