Files
OCR/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

6.3 KiB

OCR Processing Specification

ADDED Requirements

Requirement: Multi-Language Text Recognition with Structure Analysis

The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).

Scenario: Single image OCR with Chinese text

  • WHEN user uploads a PNG image containing Chinese text
  • THEN the system extracts text with bounding boxes and confidence scores
  • AND returns structured JSON with recognized text, coordinates, and language detected
  • AND generates Markdown output preserving text layout and hierarchy

Scenario: PDF document OCR with layout preservation

  • WHEN user uploads a multi-page PDF file
  • THEN the system processes each page with PaddleOCR-VL
  • AND performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
  • AND returns Markdown organized by page with preserved reading order
  • AND provides JSON with detailed layout structure and bounding boxes

Scenario: Mixed language content

  • WHEN user uploads an image with both Chinese and English text
  • THEN the system detects and extracts text in both languages
  • AND preserves the spatial relationship between text regions
  • AND maintains proper reading order in output Markdown

Scenario: Complex document with tables and images

  • WHEN user uploads a scanned document containing tables, images, and text
  • THEN the system identifies layout elements (text blocks, tables, images, formulas)
  • AND extracts table structure as Markdown tables
  • AND extracts and saves document images as separate files
  • AND embeds image references in Markdown ()
  • AND preserves document hierarchy and reading order in Markdown output

Requirement: Batch Processing

The system SHALL process multiple files concurrently with progress tracking and error handling.

Scenario: Batch upload success

  • WHEN user uploads 10 image files simultaneously
  • THEN the system creates a batch task with unique batch ID
  • AND processes files in parallel (up to configured worker limit)
  • AND returns real-time progress updates via WebSocket or polling

Scenario: Batch processing with partial failure

  • WHEN a batch contains 5 valid images and 2 corrupted files
  • THEN the system processes all valid files successfully
  • AND logs errors for corrupted files with specific error messages
  • AND marks the batch as "partially completed"

Requirement: Image Preprocessing

The system SHALL provide optional image preprocessing to improve OCR accuracy.

Scenario: Low contrast image enhancement

  • WHEN user enables preprocessing for a low-contrast image
  • THEN the system applies contrast adjustment and denoising
  • AND performs OCR on the enhanced image
  • AND returns better accuracy compared to original

Scenario: Skipped preprocessing

  • WHEN user disables preprocessing option
  • THEN the system performs OCR directly on original image
  • AND completes processing faster

Requirement: Confidence Threshold Filtering

The system SHALL filter OCR results based on configurable confidence threshold.

Scenario: High confidence filter

  • WHEN user sets confidence threshold to 0.8
  • THEN the system returns only text segments with confidence >= 0.8
  • AND discards low-confidence results

Scenario: Include all results

  • WHEN user sets confidence threshold to 0.0
  • THEN the system returns all recognized text regardless of confidence
  • AND includes confidence scores in output

Requirement: OCR Result Structure

The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.

Scenario: Successful OCR result with multiple formats

  • WHEN OCR processing completes successfully
  • THEN the system returns JSON containing:
    • File metadata (name, size, format, upload timestamp)
    • Detected text regions with bounding boxes (x, y, width, height)
    • Recognized text content for each region
    • Confidence scores (0.0 to 1.0)
    • Language detected
    • Layout element types (title, paragraph, table, image, formula)
    • Reading order sequence
    • List of extracted image files with paths
    • Processing time
    • Task status (completed/failed/partial)
  • AND generates Markdown file with logical structure
  • AND saves extracted images to storage directory
  • AND provides methods to export as searchable PDF with images

Scenario: Searchable PDF generation with images

  • WHEN user requests PDF export from OCR results
  • THEN the system converts Markdown to HTML with basic CSS styling
  • AND embeds extracted images in their logical positions (not exact original positions)
  • AND generates PDF using Pandoc + WeasyPrint
  • AND preserves document hierarchy, tables, and reading order
  • AND applies appropriate fonts for Chinese characters
  • AND produces searchable PDF (text is selectable and searchable)

Requirement: Document Translation (Reserved Architecture)

The system SHALL provide architecture and UI placeholders for future document translation features.

Scenario: Translation option visibility (UI placeholder)

  • WHEN user views OCR result page
  • THEN the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
  • AND shows target language selection dropdown (disabled)
  • AND provides tooltip: "Translation feature will be available in future release"

Scenario: Translation API endpoint (reserved)

  • WHEN backend API is queried for translation endpoints
  • THEN the system provides /api/v1/translate/document endpoint specification
  • AND returns "Not Implemented" (501) status when called
  • AND documents expected request/response format for future implementation

Scenario: Translation configuration storage (database schema)

  • WHEN database schema is created
  • THEN the system includes translation_configs table
  • AND defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
  • AND table remains empty until translation feature is implemented