fix: improve Office document processing with Direct track

- Force Office documents (PPTX, DOCX, XLSX) to use Direct track after
  LibreOffice conversion, since converted PDFs always have extractable text
- Fix PDF generator to not exclude text in image regions for Direct track,
  allowing text to render on top of background images (critical for PPT)
- Increase file_type column from VARCHAR(50) to VARCHAR(100) to support
  long MIME types like PPTX
- Remove reference to non-existent total_images metadata attribute

This significantly improves processing time for Office documents
(from ~170s OCR to ~10s Direct) while preserving text quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-30 16:22:04 +08:00
parent 6806fff1d5
commit 87dc97d951
5 changed files with 86 additions and 25 deletions

View File

@@ -854,6 +854,9 @@ class PDFGeneratorService:
# FIX: Collect exclusion regions (tables, images) to prevent duplicate rendering
regions_to_avoid = []
# Calculate page area for background detection
page_area = current_page_width * current_page_height
for element in page.elements:
if element.type == ElementType.TABLE:
table_elements.append(element)
@@ -867,6 +870,29 @@ class PDFGeneratorService:
# Charts often have large bounding boxes that include text labels
# which should be rendered as selectable text on top
if element.type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.LOGO, ElementType.STAMP]:
# Check if this is Direct track (text from PDF text layer, not OCR)
is_direct = (self.current_processing_track == ProcessingTrack.DIRECT or
self.current_processing_track == ProcessingTrack.HYBRID)
if is_direct:
# Direct track: text is from PDF text layer, not OCR'd from images
# Don't exclude any images - text should be rendered on top
# This is critical for Office documents with background images
logger.debug(f"Direct track: not excluding {element.element_id} from text regions")
continue
# OCR track: Skip full-page background images from exclusion regions
# Smaller images that might contain OCR'd text should still be excluded
if element.bbox:
elem_area = (element.bbox.x1 - element.bbox.x0) * (element.bbox.y1 - element.bbox.y0)
coverage_ratio = elem_area / page_area if page_area > 0 else 0
# If image covers >70% of page, it's likely a background - don't exclude text
if coverage_ratio > 0.7:
logger.debug(f"OCR track: skipping background image {element.element_id} from exclusion "
f"(covers {coverage_ratio*100:.1f}% of page)")
continue
regions_to_avoid.append(element)
elif element.type == ElementType.LIST_ITEM:
list_elements.append(element)