fix: resolve table/image overlap and missing images in Direct track PDF generation

egg/OCR

This commit fixes two critical rendering issues in Direct track PDF generation
that were reported by the user after the span-based rendering fixes.

## Issue 1: Table Text Overlap (表格跟文字重疊)

**Problem**: Tables rendered with duplicate text appearing on top because
DirectExtractionEngine extracts table content as both TABLE elements (with
structure) and separate TEXT elements (individual text blocks), causing
PDFGeneratorService to render both and create overlaps.

**Solution**: Implemented overlap filtering mechanism with area-based detection

**Changes**:
- Added `_is_element_inside_regions()` method in PDFGeneratorService
  - Uses overlap ratio detection (50% threshold) instead of strict containment
  - Handles cases where text blocks are larger than detected regions
  - Algorithm: filters element if ≥50% of its area overlaps with table/image bbox

- Modified `_generate_direct_track_pdf()` to:
  - Collect exclusion regions (tables + images) before rendering
  - Check each text/list element for overlap before drawing
  - Skip elements that significantly overlap with exclusion regions

**Evidence**:
- Test case: "PRODUCT DESCRIPTION" text block overlaps 74.5% with table
- File size reduced by 545 bytes (-3.8%) from filtered elements
- E2E tests passed: test_2_4_1_simple_tables, test_2_4_2_complex_tables
- User confirmed: "表格問題看起來處理好了" ✓

## Issue 2: Missing Images (圖片消失)

**Problem**: Images not rendering in generated PDFs because `extract()` was
called without `output_dir` parameter, causing images to not be saved to
filesystem, resulting in missing `saved_path` in element content.

**Solution**: Auto-create default output directory for image extraction

**Changes**:
- Modified `DirectExtractionEngine.extract()` to:
  - Auto-create `storage/results/{document_id}/` when output_dir not provided
  - Ensures images always saved when enable_image_extraction=True
  - Uses short UUID (8 chars) for cleaner directory names
  - Maintains backward compatibility (existing calls still work)

**Evidence**:
- Image extraction: 2/2 images saved to storage/results/
- Image files: 5,320 + 4,945 = 10,265 bytes total
- PDF file size: 13,627 → 26,643 bytes (+13,016 bytes, +95.5%)
- PyMuPDF verification: 2 images embedded in page 1
- E2E tests passed: test_1_3_2_direct_track_image_rendering, test_1_3_3_verify_image_paths

## Technical Details

**Overlap Filtering Algorithm**:
```
For each text/list element:
  For each table/image region:
    Calculate overlap_area = intersection(element_bbox, region_bbox)
    Calculate overlap_ratio = overlap_area / element_area
    If overlap_ratio ≥ 0.5: SKIP element (inside region)
```

**Key Advantages**:
- Area-based vs strict containment (handles larger text blocks)
- Configurable threshold (default 50%, adjustable if needed)
- Preserves reading order and layout
- No breaking changes to existing code

## Test Results

**E2E Test Suite**: 6/8 passed (2 OCR track timeouts unrelated to these fixes)
- ✅ test_1_3_2_direct_track_image_rendering
- ✅ test_1_3_3_verify_image_paths
- ✅ test_2_4_1_simple_tables
- ✅ test_2_4_2_complex_tables
- ✅ test_4_4_1_compare_direct_with_original

**File Size Evidence**:
- Text-only (no images): 13,627 bytes
- With images (both fixes): 26,643 bytes
- Difference: +13,016 bytes (+95.5%) confirming image inclusion

**Visual Quality**:
- Tables render without text overlay ✓
- Images embedded correctly (2/2) ✓
- Text outside regions still renders ✓
- No duplicate rendering ✓

## Files Changed

- backend/app/services/pdf_generator_service.py
  - Added _is_element_inside_regions() (lines 592-642)
  - Modified _generate_direct_track_pdf() (lines 697-766)

- backend/app/services/direct_extraction_engine.py
  - Modified extract() (lines 78-84)

- backend/tests/e2e/TEST_RESULTS_FINAL_FIX.md
  - Comprehensive test documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

This commit is contained in:

egg

2025-11-24 16:31:28 +08:00

parent 8333182879

commit 108784a270

3 changed files with 537 additions and 6 deletions

									
										13

backend/app/services/direct_extraction_engine.py
									
												View File
												
				@@ -63,17 +63,26 @@ class DirectExtractionEngine:

				        Args:

				            file_path: Path to PDF file

				            output_dir: Optional directory to save extracted images

				            output_dir: Optional directory to save extracted images.

				                       If not provided, creates a temporary directory in storage/results/{document_id}/

				        Returns:

				            UnifiedDocument with extracted content

				        """

				        start_time = datetime.now()

				        document_id = str(uuid.uuid4())

				        document_id = str(uuid.uuid4())[:8]  # Short ID for cleaner paths

				        try:

				            doc = fitz.open(str(file_path))

				            # If no output_dir provided, create default directory for image extraction

				            if output_dir is None and self.enable_image_extraction:

				                # Create temporary directory in storage/results

				                default_output_dir = Path("storage/results") / document_id

				                default_output_dir.mkdir(parents=True, exist_ok=True)

				                output_dir = default_output_dir

				                logger.debug(f"Created default output directory: {output_dir}")

				            # Extract document metadata

				            metadata = self._extract_metadata(file_path, doc, start_time)

fix: resolve table/image overlap and missing images in Direct track PDF generation

13 backend/app/services/direct_extraction_engine.py Unescape Escape View File

13

backend/app/services/direct_extraction_engine.py

View File