feat: implement Phase 3 list formatting for Direct track

Add comprehensive list rendering with automatic detection and formatting:

**Task 6.1: List Element Detection**
- Detect LIST_ITEM elements by type (element.type == ElementType.LIST_ITEM)
- Extract list_level from element metadata (lines 1566-1567)
- Determine list type via regex pattern matching:
  - Ordered lists: ^\d+[\.\)]\s (e.g., "1. ", "2) ")
  - Unordered lists: ^[•·▪▫◦‣⁃]\s (various bullet symbols)
- Parse and extract list markers from text content (lines 1571-1588)

**Task 6.2: List Rendering**
- Add list markers to first line of each item:
  - Ordered: Preserve original numbering (e.g., "1. ")
  - Unordered: Standardize to bullet "• "
- Remove original markers from text content
- Apply list indentation: 20pt per nesting level (lines 1594-1598)
- Combine list indent with existing paragraph indent
- List spacing: Inherited from bbox-based layout (spacing_before/after)

**Implementation Details**
- Lines 1565-1598: List detection and indentation logic
- Lines 1629-1632: Prepend list marker to first line (rendered_line)
- Lines 1635-1676: Update all text width calculations to use rendered_line
- Lines 1688-1692: Enhanced logging with list type and level

**Technical Notes**
- Direct track only (OCR track has no list metadata)
- Integrates with existing alignment and indentation system
- Preserves line breaks and multi-line list items
- Works with all text alignment modes (left/center/right/justify)

**Modified Files**
- backend/app/services/pdf_generator_service.py
  - Added import re for regex pattern matching
  - Lines 1565-1598: List detection and indentation
  - Lines 1629-1676: List marker rendering
  - Lines 1688-1692: Enhanced debug logging
- openspec/changes/pdf-layout-restoration/tasks.md
  - Marked Task 6.1 (all subtasks) as completed
  - Marked Task 6.2 (all subtasks) as completed
  - Added implementation line references

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-24 09:54:15 +08:00
parent e1e97c54cf
commit ad879d48e5
2 changed files with 52 additions and 14 deletions

View File

@@ -97,15 +97,15 @@
- [x] 5.3.4 Justify alignment with word spacing distribution
- [x] 5.3.5 OCR track: left-aligned only (no StyleInfo available)
### 6. List Formatting
- [ ] 6.1 Detect list elements
- [ ] 6.1.1 Identify list items from metadata
- [ ] 6.1.2 Determine list type (ordered/unordered)
- [ ] 6.1.3 Extract indent level
- [ ] 6.2 Render lists with proper formatting
- [ ] 6.2.1 Add bullets/numbers
- [ ] 6.2.2 Apply indentation
- [ ] 6.2.3 Maintain list spacing
### 6. List Formatting (Direct track only)
- [x] 6.1 Detect list elements from Direct track
- [x] 6.1.1 Identify LIST_ITEM elements (element.type == ElementType.LIST_ITEM)
- [x] 6.1.2 Determine list type via regex (ordered: ^\d+[\.\)], unordered: ^[•·▪▫◦‣⁃])
- [x] 6.1.3 Extract indent level from metadata (list_level, lines 1567-1598)
- [x] 6.2 Render lists with proper formatting
- [x] 6.2.1 Add bullets/numbers as list markers (lines 1571-1588, prepended to first line)
- [x] 6.2.2 Apply indentation (20pt per level, lines 1594-1598)
- [x] 6.2.3 Maintain list spacing (inherent in bbox-based layout, spacing_before/after)
### 7. Span-Level Rendering (Advanced)
- [ ] 7.1 Extract span information from Direct track