OCR/services at d6387adbd163625c690c7c21af00d250b21d0a0d - OCR - ZHAOI

egg/OCR

Files

History

egg d6387adbd1 feat: add black/white covering image detection

Implements detection of embedded images used for redaction/covering:
- Analyzes embedded images for mostly black (avg RGB <= 30) or white (>= 245)
- Uses PIL to efficiently sample image colors
- Gets image position on page via get_image_rects()
- Integrates with existing preprocessing pipeline
- Adds covering_images to page metadata and quality report

Detection results:
- demo_docs/edit3.pdf: 10 black covering images detected (7 on P1, 3 on P2)

Quality report now includes:
- total_covering_images count
- Per-page covering_images details with bbox, color_type, size

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 07:42:55 +08:00

..

__init__.py

feat: add unified JSON export with standardized schema

2025-11-19 08:36:24 +08:00

admin_service.py

fix: migrate UI to V2 API and fix admin dashboard

2025-11-17 08:55:50 +08:00

audit_service.py

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

cv_table_detector.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

dify_client.py

refactor: centralize DIFY settings in config.py and cleanup env files

2025-12-02 17:50:47 +08:00

direct_extraction_engine.py

feat: add black/white covering image detection

2025-12-04 07:42:55 +08:00

document_type_detector.py

fix: improve Office document processing with Direct track

2025-11-30 16:22:04 +08:00

external_auth_service.py

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

file_access_service.py

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

gap_filling_service.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

layout_preprocessing_service.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

memory_manager.py

feat: implement hybrid image extraction and memory management

2025-11-26 10:56:22 +08:00

ocr_service_original.py

feat: integrate dual-track processing into OCR service

2025-11-19 07:29:06 +08:00

ocr_service.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

ocr_to_unified_converter.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

office_converter.py

feat: migrate to WSL Ubuntu native development environment

2025-11-13 21:00:42 +08:00

pdf_generator_service.py

fix: improve PDF layout generation for Direct track

2025-12-03 14:55:00 +08:00

pdf_generator.py

first

2025-11-12 22:53:17 +08:00

pp_structure_debug.py

feat: simplify layout model selection and archive proposals

2025-11-27 13:27:00 +08:00

pp_structure_enhanced.py

feat: add table detection options and scan artifact removal

2025-11-30 13:21:50 +08:00

preprocessor.py

first

2025-11-12 22:53:17 +08:00

service_pool.py

feat: implement hybrid image extraction and memory management

2025-11-26 10:56:22 +08:00

task_service.py

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

translation_service.py

refactor: centralize DIFY settings in config.py and cleanup env files

2025-12-02 17:50:47 +08:00

unified_document_exporter.py

feat: add unified JSON export with standardized schema

2025-11-19 08:36:24 +08:00