fix: properly implement list formatting with sequential numbering and grouping

Fix critical issues in Task 6 list formatting implementation:

**Issue 1: LIST_ITEM Elements Not Rendered**
- Problem: LIST_ITEM type not included in is_text property
- Fix: Separate list_elements from text_elements (lines 626, 636-637)
- Impact: List items were completely ignored in rendering

**Issue 2: Missing Sequential Numbering**
- Problem: Each list item independently parsed its own number
- Fix: Implement _draw_list_elements_direct method (lines 1523-1610)
- Groups list items by proximity (max_gap=30pt) and level
- Maintains list_counter across items for sequential numbering
- Starts from original number in first item

**Issue 3: Unreliable List Type Detection**
- Problem: Regex-based detection per item, not per list
- Fix: Detect type from first item in group, apply to all items
- Store computed marker in metadata (_list_marker, _list_type)
- Ensures consistency across entire list

**Issue 4: Insufficient List Spacing Control**
- Problem: No grouping logic, relied solely on bbox positions
- Fix: Proximity-based grouping with 30pt max gap threshold
- Groups consecutive items into lists
- Separates lists when gap exceeds threshold or level changes

**Technical Implementation**

New method: _draw_list_elements_direct (lines 1523-1610)
- Sort items by position (y0, x0)
- Group by proximity and level
- Detect list type from first item
- Assign sequential markers
- Store in metadata for _draw_text_element_direct

Updated: _draw_text_element_direct (lines 1662-1677)
- Use pre-computed _list_marker from metadata
- Simplified marker removal (just clean original markers)
- No longer needs to maintain counter per-item

Updated: _generate_direct_track_pdf (lines 622-663)
- Separate list_elements collection
- Call _draw_list_elements_direct before text rendering
- Updated logging to show list item count

**Modified Files**
- backend/app/services/pdf_generator_service.py
  - Lines 626, 636-637: Separate list_elements
  - Lines 644-646: Updated logging
  - Lines 658-659: Add list rendering layer
  - Lines 1523-1610: New _draw_list_elements_direct method
  - Lines 1662-1677: Simplified list detection in _draw_text_element_direct
- openspec/changes/pdf-layout-restoration/tasks.md
  - Updated Task 6.1 subtasks with accurate implementation details
  - Updated Task 6.2 subtasks with grouping and numbering logic

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-24 09:59:00 +08:00
parent ad879d48e5
commit 1ec186f680
2 changed files with 119 additions and 28 deletions

View File

@@ -99,13 +99,16 @@
### 6. List Formatting (Direct track only)
- [x] 6.1 Detect list elements from Direct track
- [x] 6.1.1 Identify LIST_ITEM elements (element.type == ElementType.LIST_ITEM)
- [x] 6.1.2 Determine list type via regex (ordered: ^\d+[\.\)], unordered: ^[•·▪▫◦‣⁃])
- [x] 6.1.3 Extract indent level from metadata (list_level, lines 1567-1598)
- [x] 6.1.1 Identify LIST_ITEM elements (separate from text_elements, lines 636-637)
- [x] 6.1.2 Group list items by proximity and level (_draw_list_elements_direct, lines 1543-1570)
- [x] 6.1.3 Determine list type via regex on first item (ordered/unordered, lines 1582-1590)
- [x] 6.1.4 Extract indent level from metadata (list_level)
- [x] 6.2 Render lists with proper formatting
- [x] 6.2.1 Add bullets/numbers as list markers (lines 1571-1588, prepended to first line)
- [x] 6.2.2 Apply indentation (20pt per level, lines 1594-1598)
- [x] 6.2.3 Maintain list spacing (inherent in bbox-based layout, spacing_before/after)
- [x] 6.2.1 Sequential numbering across list items (list_counter, lines 1593-1602)
- [x] 6.2.2 Add bullets/numbers as list markers (stored in _list_marker metadata, lines 1603-1607)
- [x] 6.2.3 Apply indentation (20pt per level, lines 1683-1687)
- [x] 6.2.4 Remove original markers from text content (lines 1671-1677)
- [x] 6.2.5 Maintain list spacing via proximity-based grouping (max_gap=30pt, lines 1551-1563)
### 7. Span-Level Rendering (Advanced)
- [ ] 7.1 Extract span information from Direct track