Explore Help

Register Sign In

egg/OCR

1

0

You've already forked OCR

Code Issues Pull Requests Actions Packages Projects Releases Wiki Activity

Files

65abd51d60393e2d77ac8c46a2b074fcc89952d2

OCR/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md

egg 8265be1741 test

2025-12-04 18:00:37 +08:00

3.7 KiB

Raw Blame History

Tasks: PDF Preprocessing Pipeline

Phase 1: Object-level Cleaning (P0)

Step 1.1: Content Sanitization

Add page.clean_contents(sanitize=True) to _extract_page()
Add error handling for malformed content streams
Add logging for sanitization actions

Step 1.2: Hidden Layer (OCG) Removal

Implement get_hidden_ocg_layers() function
Add OCG content filtering during extraction (deferred - needs test case)
Add configuration option remove_hidden_layers
Add logging for removed layers

Step 1.3: White-out Detection

Implement detect_whiteout_covered_text() with IoU calculation
Add white rectangle detection from page.get_drawings()
Integrate covered text filtering into extraction
Add configuration option whiteout_iou_threshold (default 0.8)
Add logging for detected white-out regions

Phase 2: Layout Analysis (P1)

Step 2.1: Column-aware Sorting

Change get_text() calls to use sort=True parameter (already implemented)
Verify reading order improvement on test documents
Add configuration option column_aware_sort (deferred - low priority)

Step 2.2: Element Classification

Implement classify_element() function (deferred - existing detection sufficient)
Add position-based classification (header/footer/body) - via existing _detect_headers_footers()
Add font-size-based classification (title detection) - via existing logic
Add page number pattern detection _is_page_number()
Preserve classification in element metadata _element_type (deferred)

Step 2.3: Element Filtering

Implement filter_elements() function - _filter_page_numbers()
Add configuration options for filtering (page_numbers, headers, footers)
Add logging for filtered elements

Phase 3: Enhanced Extraction (P1)

Step 3.1: Bbox Preservation

Ensure all extracted elements retain bbox coordinates (already implemented)
Add bbox to UnifiedDocument element metadata
Verify bbox accuracy in generated output

Step 3.2: Garble Detection

Implement calculate_garble_rate() function
Detect (cid:xxxx) patterns
Detect replacement characters (U+FFFD)
Detect Private Use Area characters
Add garble rate to page metadata

Step 3.3: OCR Fallback

Implement should_fallback_to_ocr() decision function
Add configuration option ocr_fallback_threshold (default 0.1)
Add get_pages_needing_ocr() interface for callers
Add get_extraction_quality_report() for quality metrics
Add logging for fallback decisions

Phase 4: GS Distillation - Exception Handler (P2)

Step 0: GS Repair (Optional)

Implement should_trigger_gs_repair() trigger detection
Implement repair_pdf_with_gs() function
Add -dDetectDuplicateImages=true option
Add temporary file handling for repaired PDF
Implement is_ghostscript_available() check
Add extract_with_repair() method
Add fallback to normal extraction if GS not available
Add logging for GS repair actions

Testing

Unit Tests

Test white-out detection with synthetic PDF
Test garble rate calculation
Test element classification accuracy
Test page number pattern detection

Integration Tests

Test with demo_docs/edit.pdf (3 pages)
Test with demo_docs/edit2.pdf (1 page)
Test with demo_docs/edit3.pdf (2 pages)
Test quality report generation
Test GS availability check
Test end-to-end pipeline with real documents

Regression Tests

Verify existing clean PDFs produce same output
Performance benchmark (<100ms overhead per page)

Powered by Gitea Version: 24.5.3 Page: 44ms Template: 0ms

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API