chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,175 @@
# Export Results Specification
## ADDED Requirements
### Requirement: Plain Text Export
The system SHALL export OCR results as plain text files with configurable formatting.
#### Scenario: Export single file result as TXT
- **WHEN** user selects a completed OCR task and chooses TXT export
- **THEN** the system generates a .txt file with extracted text
- **AND** preserves line breaks based on bounding box positions
- **AND** returns downloadable file
#### Scenario: Export batch results as TXT
- **WHEN** user exports a batch with 5 files as TXT
- **THEN** the system creates a ZIP file containing 5 .txt files
- **AND** names each file as `{original_filename}_ocr.txt`
- **AND** returns the ZIP for download
### Requirement: JSON Export
The system SHALL export OCR results as structured JSON with full metadata.
#### Scenario: Export with metadata
- **WHEN** user selects JSON export format
- **THEN** the system generates JSON containing:
- File information (name, size, format)
- OCR results array with text, bounding boxes, confidence
- Processing metadata (timestamp, language, model version)
- Task status and statistics
#### Scenario: JSON export example structure
- **WHEN** export is generated
- **THEN** JSON structure follows this format:
```json
{
"file_name": "document.png",
"file_size": 1024000,
"upload_time": "2025-01-01T10:00:00Z",
"processing_time": 2.5,
"language": "zh-TW",
"results": [
{
"text": "範例文字",
"bbox": [100, 50, 200, 80],
"confidence": 0.95
}
],
"status": "completed"
}
```
### Requirement: Excel Export
The system SHALL export OCR results as Excel spreadsheets with tabular format.
#### Scenario: Single file Excel export
- **WHEN** user selects Excel export for one file
- **THEN** the system generates .xlsx file with columns:
- Row Number
- Recognized Text
- Confidence Score
- Bounding Box (X, Y, Width, Height)
- Language
#### Scenario: Batch Excel export with multiple sheets
- **WHEN** user exports batch with 3 files as Excel
- **THEN** the system creates one .xlsx file with 3 sheets
- **AND** names each sheet as the original filename
- **AND** includes summary sheet with statistics
### Requirement: Rule-Based Output Formatting
The system SHALL apply user-defined rules to format exported text.
#### Scenario: Group by filename pattern
- **WHEN** user defines rule "group files with prefix 'invoice_'"
- **THEN** the system groups all matching files together
- **AND** exports them in a single combined file or folder
#### Scenario: Filter by confidence threshold
- **WHEN** user sets export rule "minimum confidence 0.8"
- **THEN** the system excludes text with confidence < 0.8 from export
- **AND** includes only high-confidence results
#### Scenario: Custom text formatting
- **WHEN** user defines rule "add line numbers"
- **THEN** the system prepends line numbers to each text line
- **AND** formats output as: `1. 第一行文字\n2. 第二行文字`
#### Scenario: Sort by reading order
- **WHEN** user enables "sort by position" rule
- **THEN** the system orders text by vertical position (top to bottom)
- **AND** then by horizontal position (left to right) within each row
- **AND** exports text in natural reading order
### Requirement: Export Rule Configuration
The system SHALL allow users to save and reuse export rules.
#### Scenario: Save custom export rule
- **WHEN** user creates a rule with name "高品質發票輸出"
- **THEN** the system saves the rule to database
- **AND** associates it with the user account
- **AND** makes it available in rule selection dropdown
#### Scenario: Apply saved rule
- **WHEN** user selects a saved rule for export
- **THEN** the system applies all configured filters and formatting
- **AND** generates output according to rule settings
#### Scenario: Edit existing rule
- **WHEN** user modifies a saved rule
- **THEN** the system updates the rule configuration
- **AND** preserves the rule ID for continuity
### Requirement: Markdown Export with Structure and Images
The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images.
#### Scenario: Export as Markdown with structure and images
- **WHEN** user selects Markdown export format
- **THEN** the system generates .md file with logical structure
- **AND** includes headings, paragraphs, tables, lists in proper hierarchy
- **AND** embeds image references pointing to extracted images (![](./images/img1.jpg))
- **AND** maintains reading order from OCR analysis
- **AND** includes extracted images in an images/ folder
#### Scenario: Batch Markdown export with images
- **WHEN** user exports batch with 5 files as Markdown
- **THEN** the system creates 5 separate .md files
- **AND** creates corresponding images/ folders for each document
- **AND** optionally creates combined .md with page separators
- **AND** returns ZIP file containing all Markdown files and images
### Requirement: Searchable PDF Export with Images
The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout).
#### Scenario: Single document PDF export with images
- **WHEN** user requests PDF export from OCR result
- **THEN** the system converts Markdown to HTML with basic CSS styling
- **AND** embeds extracted images from images/ folder
- **AND** generates PDF using Pandoc + WeasyPrint
- **AND** preserves document hierarchy, tables, and reading order
- **AND** images appear near their logical position in text flow
- **AND** uses appropriate Chinese font (Noto Sans CJK)
- **AND** produces searchable PDF with selectable text
#### Scenario: Basic PDF formatting options
- **WHEN** user selects PDF export
- **THEN** the system applies basic readable formatting
- **AND** sets standard margins and page size (A4)
- **AND** uses consistent fonts and spacing
- **AND** ensures images fit within page width
- **NOTE** CSS templates are for basic readability, not for replicating original visual design
#### Scenario: Batch PDF export with images
- **WHEN** user exports batch as PDF
- **THEN** the system generates individual PDF for each document with embedded images
- **OR** creates single merged PDF with page breaks
- **AND** maintains consistent formatting across all pages
- **AND** returns ZIP of PDFs or single merged PDF
### Requirement: Export Format Selection
The system SHALL provide UI for selecting export format and options.
#### Scenario: Format selection with preview
- **WHEN** user opens export dialog
- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**)
- **AND** shows preview of output structure for selected format
- **AND** allows applying custom rules for text filtering
- **AND** provides basic formatting option for PDF (standard readable format)
#### Scenario: Batch export with format choice
- **WHEN** user selects multiple completed tasks
- **THEN** the system enables batch export button
- **AND** prompts for format selection
- **AND** generates combined export file
- **AND** shows progress bar for PDF generation (slower due to image processing)
- **AND** includes all extracted images when exporting Markdown or PDF