chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,175 @@
|
||||
# Export Results Specification
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Plain Text Export
|
||||
The system SHALL export OCR results as plain text files with configurable formatting.
|
||||
|
||||
#### Scenario: Export single file result as TXT
|
||||
- **WHEN** user selects a completed OCR task and chooses TXT export
|
||||
- **THEN** the system generates a .txt file with extracted text
|
||||
- **AND** preserves line breaks based on bounding box positions
|
||||
- **AND** returns downloadable file
|
||||
|
||||
#### Scenario: Export batch results as TXT
|
||||
- **WHEN** user exports a batch with 5 files as TXT
|
||||
- **THEN** the system creates a ZIP file containing 5 .txt files
|
||||
- **AND** names each file as `{original_filename}_ocr.txt`
|
||||
- **AND** returns the ZIP for download
|
||||
|
||||
### Requirement: JSON Export
|
||||
The system SHALL export OCR results as structured JSON with full metadata.
|
||||
|
||||
#### Scenario: Export with metadata
|
||||
- **WHEN** user selects JSON export format
|
||||
- **THEN** the system generates JSON containing:
|
||||
- File information (name, size, format)
|
||||
- OCR results array with text, bounding boxes, confidence
|
||||
- Processing metadata (timestamp, language, model version)
|
||||
- Task status and statistics
|
||||
|
||||
#### Scenario: JSON export example structure
|
||||
- **WHEN** export is generated
|
||||
- **THEN** JSON structure follows this format:
|
||||
```json
|
||||
{
|
||||
"file_name": "document.png",
|
||||
"file_size": 1024000,
|
||||
"upload_time": "2025-01-01T10:00:00Z",
|
||||
"processing_time": 2.5,
|
||||
"language": "zh-TW",
|
||||
"results": [
|
||||
{
|
||||
"text": "範例文字",
|
||||
"bbox": [100, 50, 200, 80],
|
||||
"confidence": 0.95
|
||||
}
|
||||
],
|
||||
"status": "completed"
|
||||
}
|
||||
```
|
||||
|
||||
### Requirement: Excel Export
|
||||
The system SHALL export OCR results as Excel spreadsheets with tabular format.
|
||||
|
||||
#### Scenario: Single file Excel export
|
||||
- **WHEN** user selects Excel export for one file
|
||||
- **THEN** the system generates .xlsx file with columns:
|
||||
- Row Number
|
||||
- Recognized Text
|
||||
- Confidence Score
|
||||
- Bounding Box (X, Y, Width, Height)
|
||||
- Language
|
||||
|
||||
#### Scenario: Batch Excel export with multiple sheets
|
||||
- **WHEN** user exports batch with 3 files as Excel
|
||||
- **THEN** the system creates one .xlsx file with 3 sheets
|
||||
- **AND** names each sheet as the original filename
|
||||
- **AND** includes summary sheet with statistics
|
||||
|
||||
### Requirement: Rule-Based Output Formatting
|
||||
The system SHALL apply user-defined rules to format exported text.
|
||||
|
||||
#### Scenario: Group by filename pattern
|
||||
- **WHEN** user defines rule "group files with prefix 'invoice_'"
|
||||
- **THEN** the system groups all matching files together
|
||||
- **AND** exports them in a single combined file or folder
|
||||
|
||||
#### Scenario: Filter by confidence threshold
|
||||
- **WHEN** user sets export rule "minimum confidence 0.8"
|
||||
- **THEN** the system excludes text with confidence < 0.8 from export
|
||||
- **AND** includes only high-confidence results
|
||||
|
||||
#### Scenario: Custom text formatting
|
||||
- **WHEN** user defines rule "add line numbers"
|
||||
- **THEN** the system prepends line numbers to each text line
|
||||
- **AND** formats output as: `1. 第一行文字\n2. 第二行文字`
|
||||
|
||||
#### Scenario: Sort by reading order
|
||||
- **WHEN** user enables "sort by position" rule
|
||||
- **THEN** the system orders text by vertical position (top to bottom)
|
||||
- **AND** then by horizontal position (left to right) within each row
|
||||
- **AND** exports text in natural reading order
|
||||
|
||||
### Requirement: Export Rule Configuration
|
||||
The system SHALL allow users to save and reuse export rules.
|
||||
|
||||
#### Scenario: Save custom export rule
|
||||
- **WHEN** user creates a rule with name "高品質發票輸出"
|
||||
- **THEN** the system saves the rule to database
|
||||
- **AND** associates it with the user account
|
||||
- **AND** makes it available in rule selection dropdown
|
||||
|
||||
#### Scenario: Apply saved rule
|
||||
- **WHEN** user selects a saved rule for export
|
||||
- **THEN** the system applies all configured filters and formatting
|
||||
- **AND** generates output according to rule settings
|
||||
|
||||
#### Scenario: Edit existing rule
|
||||
- **WHEN** user modifies a saved rule
|
||||
- **THEN** the system updates the rule configuration
|
||||
- **AND** preserves the rule ID for continuity
|
||||
|
||||
### Requirement: Markdown Export with Structure and Images
|
||||
The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images.
|
||||
|
||||
#### Scenario: Export as Markdown with structure and images
|
||||
- **WHEN** user selects Markdown export format
|
||||
- **THEN** the system generates .md file with logical structure
|
||||
- **AND** includes headings, paragraphs, tables, lists in proper hierarchy
|
||||
- **AND** embeds image references pointing to extracted images ()
|
||||
- **AND** maintains reading order from OCR analysis
|
||||
- **AND** includes extracted images in an images/ folder
|
||||
|
||||
#### Scenario: Batch Markdown export with images
|
||||
- **WHEN** user exports batch with 5 files as Markdown
|
||||
- **THEN** the system creates 5 separate .md files
|
||||
- **AND** creates corresponding images/ folders for each document
|
||||
- **AND** optionally creates combined .md with page separators
|
||||
- **AND** returns ZIP file containing all Markdown files and images
|
||||
|
||||
### Requirement: Searchable PDF Export with Images
|
||||
The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout).
|
||||
|
||||
#### Scenario: Single document PDF export with images
|
||||
- **WHEN** user requests PDF export from OCR result
|
||||
- **THEN** the system converts Markdown to HTML with basic CSS styling
|
||||
- **AND** embeds extracted images from images/ folder
|
||||
- **AND** generates PDF using Pandoc + WeasyPrint
|
||||
- **AND** preserves document hierarchy, tables, and reading order
|
||||
- **AND** images appear near their logical position in text flow
|
||||
- **AND** uses appropriate Chinese font (Noto Sans CJK)
|
||||
- **AND** produces searchable PDF with selectable text
|
||||
|
||||
#### Scenario: Basic PDF formatting options
|
||||
- **WHEN** user selects PDF export
|
||||
- **THEN** the system applies basic readable formatting
|
||||
- **AND** sets standard margins and page size (A4)
|
||||
- **AND** uses consistent fonts and spacing
|
||||
- **AND** ensures images fit within page width
|
||||
- **NOTE** CSS templates are for basic readability, not for replicating original visual design
|
||||
|
||||
#### Scenario: Batch PDF export with images
|
||||
- **WHEN** user exports batch as PDF
|
||||
- **THEN** the system generates individual PDF for each document with embedded images
|
||||
- **OR** creates single merged PDF with page breaks
|
||||
- **AND** maintains consistent formatting across all pages
|
||||
- **AND** returns ZIP of PDFs or single merged PDF
|
||||
|
||||
### Requirement: Export Format Selection
|
||||
The system SHALL provide UI for selecting export format and options.
|
||||
|
||||
#### Scenario: Format selection with preview
|
||||
- **WHEN** user opens export dialog
|
||||
- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**)
|
||||
- **AND** shows preview of output structure for selected format
|
||||
- **AND** allows applying custom rules for text filtering
|
||||
- **AND** provides basic formatting option for PDF (standard readable format)
|
||||
|
||||
#### Scenario: Batch export with format choice
|
||||
- **WHEN** user selects multiple completed tasks
|
||||
- **THEN** the system enables batch export button
|
||||
- **AND** prompts for format selection
|
||||
- **AND** generates combined export file
|
||||
- **AND** shows progress bar for PDF generation (slower due to image processing)
|
||||
- **AND** includes all extracted images when exporting Markdown or PDF
|
||||
@@ -0,0 +1,96 @@
|
||||
# File Management Specification
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: File Upload Validation
|
||||
The system SHALL validate uploaded files for type, size, and content before processing.
|
||||
|
||||
#### Scenario: Valid image upload
|
||||
- **WHEN** user uploads a PNG file of 5MB
|
||||
- **THEN** the system accepts the file
|
||||
- **AND** stores it in temporary upload directory
|
||||
- **AND** returns upload success with file ID
|
||||
|
||||
#### Scenario: Oversized file rejection
|
||||
- **WHEN** user uploads a file larger than 20MB
|
||||
- **THEN** the system rejects the file
|
||||
- **AND** returns error message "文件大小超過限制 (最大 20MB)"
|
||||
- **AND** does not store the file
|
||||
|
||||
#### Scenario: Invalid file type rejection
|
||||
- **WHEN** user uploads a .exe or .zip file
|
||||
- **THEN** the system rejects the file
|
||||
- **AND** returns error message "不支援的文件類型,僅支援 PNG, JPG, JPEG, PDF"
|
||||
|
||||
#### Scenario: Corrupted image detection
|
||||
- **WHEN** user uploads a corrupted image file
|
||||
- **THEN** the system attempts to open the file
|
||||
- **AND** detects corruption during validation
|
||||
- **AND** returns error message "文件損壞,無法處理"
|
||||
|
||||
### Requirement: Supported File Formats
|
||||
The system SHALL support PNG, JPG, JPEG, and PDF file formats for OCR processing.
|
||||
|
||||
#### Scenario: PNG image processing
|
||||
- **WHEN** user uploads a .png file
|
||||
- **THEN** the system processes it directly with PaddleOCR
|
||||
|
||||
#### Scenario: JPG/JPEG image processing
|
||||
- **WHEN** user uploads a .jpg or .jpeg file
|
||||
- **THEN** the system processes it directly with PaddleOCR
|
||||
|
||||
#### Scenario: PDF file processing
|
||||
- **WHEN** user uploads a .pdf file
|
||||
- **THEN** the system converts PDF pages to images using pdf2image
|
||||
- **AND** processes each page image with PaddleOCR
|
||||
|
||||
### Requirement: Batch Upload Management
|
||||
The system SHALL manage multiple file uploads with batch organization.
|
||||
|
||||
#### Scenario: Create batch from multiple files
|
||||
- **WHEN** user uploads 5 files in a single request
|
||||
- **THEN** the system creates a batch with unique batch_id
|
||||
- **AND** associates all files with the batch_id
|
||||
- **AND** returns batch_id and file list
|
||||
|
||||
#### Scenario: Query batch status
|
||||
- **WHEN** user requests batch status by batch_id
|
||||
- **THEN** the system returns:
|
||||
- Total files in batch
|
||||
- Completed count
|
||||
- Failed count
|
||||
- Processing count
|
||||
- Overall batch status (pending/processing/completed/failed)
|
||||
|
||||
### Requirement: File Storage Management
|
||||
The system SHALL store uploaded files temporarily and clean up after processing.
|
||||
|
||||
#### Scenario: Temporary file storage
|
||||
- **WHEN** user uploads files
|
||||
- **THEN** the system stores files in `uploads/{batch_id}/` directory
|
||||
- **AND** generates unique filenames to prevent conflicts
|
||||
|
||||
#### Scenario: Automatic cleanup after processing
|
||||
- **WHEN** OCR processing completes for a batch
|
||||
- **THEN** the system keeps files for 24 hours
|
||||
- **AND** automatically deletes files after retention period
|
||||
- **AND** preserves OCR results in database
|
||||
|
||||
#### Scenario: Manual file deletion
|
||||
- **WHEN** user requests to delete a batch
|
||||
- **THEN** the system removes all associated files from storage
|
||||
- **AND** marks the batch as deleted in database
|
||||
- **AND** returns deletion confirmation
|
||||
|
||||
### Requirement: File Access Control
|
||||
The system SHALL ensure users can only access their own uploaded files.
|
||||
|
||||
#### Scenario: User accesses own files
|
||||
- **WHEN** authenticated user requests file by file_id
|
||||
- **THEN** the system verifies ownership
|
||||
- **AND** returns file if user is the owner
|
||||
|
||||
#### Scenario: User attempts to access others' files
|
||||
- **WHEN** user requests file_id belonging to another user
|
||||
- **THEN** the system denies access
|
||||
- **AND** returns 403 Forbidden error
|
||||
@@ -0,0 +1,125 @@
|
||||
# OCR Processing Specification
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Multi-Language Text Recognition with Structure Analysis
|
||||
The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
|
||||
|
||||
#### Scenario: Single image OCR with Chinese text
|
||||
- **WHEN** user uploads a PNG image containing Chinese text
|
||||
- **THEN** the system extracts text with bounding boxes and confidence scores
|
||||
- **AND** returns structured JSON with recognized text, coordinates, and language detected
|
||||
- **AND** generates Markdown output preserving text layout and hierarchy
|
||||
|
||||
#### Scenario: PDF document OCR with layout preservation
|
||||
- **WHEN** user uploads a multi-page PDF file
|
||||
- **THEN** the system processes each page with PaddleOCR-VL
|
||||
- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
|
||||
- **AND** returns Markdown organized by page with preserved reading order
|
||||
- **AND** provides JSON with detailed layout structure and bounding boxes
|
||||
|
||||
#### Scenario: Mixed language content
|
||||
- **WHEN** user uploads an image with both Chinese and English text
|
||||
- **THEN** the system detects and extracts text in both languages
|
||||
- **AND** preserves the spatial relationship between text regions
|
||||
- **AND** maintains proper reading order in output Markdown
|
||||
|
||||
#### Scenario: Complex document with tables and images
|
||||
- **WHEN** user uploads a scanned document containing tables, images, and text
|
||||
- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
|
||||
- **AND** extracts table structure as Markdown tables
|
||||
- **AND** extracts and saves document images as separate files
|
||||
- **AND** embeds image references in Markdown ()
|
||||
- **AND** preserves document hierarchy and reading order in Markdown output
|
||||
|
||||
### Requirement: Batch Processing
|
||||
The system SHALL process multiple files concurrently with progress tracking and error handling.
|
||||
|
||||
#### Scenario: Batch upload success
|
||||
- **WHEN** user uploads 10 image files simultaneously
|
||||
- **THEN** the system creates a batch task with unique batch ID
|
||||
- **AND** processes files in parallel (up to configured worker limit)
|
||||
- **AND** returns real-time progress updates via WebSocket or polling
|
||||
|
||||
#### Scenario: Batch processing with partial failure
|
||||
- **WHEN** a batch contains 5 valid images and 2 corrupted files
|
||||
- **THEN** the system processes all valid files successfully
|
||||
- **AND** logs errors for corrupted files with specific error messages
|
||||
- **AND** marks the batch as "partially completed"
|
||||
|
||||
### Requirement: Image Preprocessing
|
||||
The system SHALL provide optional image preprocessing to improve OCR accuracy.
|
||||
|
||||
#### Scenario: Low contrast image enhancement
|
||||
- **WHEN** user enables preprocessing for a low-contrast image
|
||||
- **THEN** the system applies contrast adjustment and denoising
|
||||
- **AND** performs OCR on the enhanced image
|
||||
- **AND** returns better accuracy compared to original
|
||||
|
||||
#### Scenario: Skipped preprocessing
|
||||
- **WHEN** user disables preprocessing option
|
||||
- **THEN** the system performs OCR directly on original image
|
||||
- **AND** completes processing faster
|
||||
|
||||
### Requirement: Confidence Threshold Filtering
|
||||
The system SHALL filter OCR results based on configurable confidence threshold.
|
||||
|
||||
#### Scenario: High confidence filter
|
||||
- **WHEN** user sets confidence threshold to 0.8
|
||||
- **THEN** the system returns only text segments with confidence >= 0.8
|
||||
- **AND** discards low-confidence results
|
||||
|
||||
#### Scenario: Include all results
|
||||
- **WHEN** user sets confidence threshold to 0.0
|
||||
- **THEN** the system returns all recognized text regardless of confidence
|
||||
- **AND** includes confidence scores in output
|
||||
|
||||
### Requirement: OCR Result Structure
|
||||
The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
|
||||
|
||||
#### Scenario: Successful OCR result with multiple formats
|
||||
- **WHEN** OCR processing completes successfully
|
||||
- **THEN** the system returns JSON containing:
|
||||
- File metadata (name, size, format, upload timestamp)
|
||||
- Detected text regions with bounding boxes (x, y, width, height)
|
||||
- Recognized text content for each region
|
||||
- Confidence scores (0.0 to 1.0)
|
||||
- Language detected
|
||||
- Layout element types (title, paragraph, table, image, formula)
|
||||
- Reading order sequence
|
||||
- List of extracted image files with paths
|
||||
- Processing time
|
||||
- Task status (completed/failed/partial)
|
||||
- **AND** generates Markdown file with logical structure
|
||||
- **AND** saves extracted images to storage directory
|
||||
- **AND** provides methods to export as searchable PDF with images
|
||||
|
||||
#### Scenario: Searchable PDF generation with images
|
||||
- **WHEN** user requests PDF export from OCR results
|
||||
- **THEN** the system converts Markdown to HTML with basic CSS styling
|
||||
- **AND** embeds extracted images in their logical positions (not exact original positions)
|
||||
- **AND** generates PDF using Pandoc + WeasyPrint
|
||||
- **AND** preserves document hierarchy, tables, and reading order
|
||||
- **AND** applies appropriate fonts for Chinese characters
|
||||
- **AND** produces searchable PDF (text is selectable and searchable)
|
||||
|
||||
### Requirement: Document Translation (Reserved Architecture)
|
||||
The system SHALL provide architecture and UI placeholders for future document translation features.
|
||||
|
||||
#### Scenario: Translation option visibility (UI placeholder)
|
||||
- **WHEN** user views OCR result page
|
||||
- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
|
||||
- **AND** shows target language selection dropdown (disabled)
|
||||
- **AND** provides tooltip: "Translation feature will be available in future release"
|
||||
|
||||
#### Scenario: Translation API endpoint (reserved)
|
||||
- **WHEN** backend API is queried for translation endpoints
|
||||
- **THEN** the system provides `/api/v1/translate/document` endpoint specification
|
||||
- **AND** returns "Not Implemented" (501) status when called
|
||||
- **AND** documents expected request/response format for future implementation
|
||||
|
||||
#### Scenario: Translation configuration storage (database schema)
|
||||
- **WHEN** database schema is created
|
||||
- **THEN** the system includes `translation_configs` table
|
||||
- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
|
||||
- **AND** table remains empty until translation feature is implemented
|
||||
Reference in New Issue
Block a user