egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

6.3 KiB

Raw Blame History

OCR Processing Specification

ADDED Requirements

Requirement: Multi-Language Text Recognition with Structure Analysis

The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).

Scenario: Single image OCR with Chinese text

WHEN user uploads a PNG image containing Chinese text
THEN the system extracts text with bounding boxes and confidence scores
AND returns structured JSON with recognized text, coordinates, and language detected
AND generates Markdown output preserving text layout and hierarchy

Scenario: PDF document OCR with layout preservation

WHEN user uploads a multi-page PDF file
THEN the system processes each page with PaddleOCR-VL
AND performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
AND returns Markdown organized by page with preserved reading order
AND provides JSON with detailed layout structure and bounding boxes

Scenario: Mixed language content

WHEN user uploads an image with both Chinese and English text
THEN the system detects and extracts text in both languages
AND preserves the spatial relationship between text regions
AND maintains proper reading order in output Markdown

Scenario: Complex document with tables and images

WHEN user uploads a scanned document containing tables, images, and text
THEN the system identifies layout elements (text blocks, tables, images, formulas)
AND extracts table structure as Markdown tables
AND extracts and saves document images as separate files
AND embeds image references in Markdown ()
AND preserves document hierarchy and reading order in Markdown output

Requirement: Batch Processing

The system SHALL process multiple files concurrently with progress tracking and error handling.

Scenario: Batch upload success

WHEN user uploads 10 image files simultaneously
THEN the system creates a batch task with unique batch ID
AND processes files in parallel (up to configured worker limit)
AND returns real-time progress updates via WebSocket or polling

Scenario: Batch processing with partial failure

WHEN a batch contains 5 valid images and 2 corrupted files
THEN the system processes all valid files successfully
AND logs errors for corrupted files with specific error messages
AND marks the batch as "partially completed"

Requirement: Image Preprocessing

The system SHALL provide optional image preprocessing to improve OCR accuracy.

Scenario: Low contrast image enhancement

WHEN user enables preprocessing for a low-contrast image
THEN the system applies contrast adjustment and denoising
AND performs OCR on the enhanced image
AND returns better accuracy compared to original

Scenario: Skipped preprocessing

WHEN user disables preprocessing option
THEN the system performs OCR directly on original image
AND completes processing faster

Requirement: Confidence Threshold Filtering

The system SHALL filter OCR results based on configurable confidence threshold.

Scenario: High confidence filter

WHEN user sets confidence threshold to 0.8
THEN the system returns only text segments with confidence >= 0.8
AND discards low-confidence results

Scenario: Include all results

WHEN user sets confidence threshold to 0.0
THEN the system returns all recognized text regardless of confidence
AND includes confidence scores in output

Requirement: OCR Result Structure

The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.

Scenario: Successful OCR result with multiple formats

WHEN OCR processing completes successfully
THEN the system returns JSON containing:
- File metadata (name, size, format, upload timestamp)
- Detected text regions with bounding boxes (x, y, width, height)
- Recognized text content for each region
- Confidence scores (0.0 to 1.0)
- Language detected
- Layout element types (title, paragraph, table, image, formula)
- Reading order sequence
- List of extracted image files with paths
- Processing time
- Task status (completed/failed/partial)
AND generates Markdown file with logical structure
AND saves extracted images to storage directory
AND provides methods to export as searchable PDF with images

Scenario: Searchable PDF generation with images

WHEN user requests PDF export from OCR results
THEN the system converts Markdown to HTML with basic CSS styling
AND embeds extracted images in their logical positions (not exact original positions)
AND generates PDF using Pandoc + WeasyPrint
AND preserves document hierarchy, tables, and reading order
AND applies appropriate fonts for Chinese characters
AND produces searchable PDF (text is selectable and searchable)

Requirement: Document Translation (Reserved Architecture)

The system SHALL provide architecture and UI placeholders for future document translation features.

Scenario: Translation option visibility (UI placeholder)

WHEN user views OCR result page
THEN the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
AND shows target language selection dropdown (disabled)
AND provides tooltip: "Translation feature will be available in future release"

Scenario: Translation API endpoint (reserved)

WHEN backend API is queried for translation endpoints
THEN the system provides /api/v1/translate/document endpoint specification
AND returns "Not Implemented" (501) status when called
AND documents expected request/response format for future implementation

Scenario: Translation configuration storage (database schema)

WHEN database schema is created
THEN the system includes translation_configs table
AND defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
AND table remains empty until translation feature is implemented

6.3 KiB Raw Blame History