issues/5-026-optimize-chronological-html-generation-performance.md
Issue 5-026: Optimize Chronological HTML Generation Performance
Current Behavior
- The chronological.html file generation is extremely slow (times out after 2 minutes)
- Generated file is 12MB with 93,751 lines containing what appears to be PDF binary data mixed with HTML
- Browser performance is severely impacted when loading the page
- The flat-html-generator.lua script processes 6,580+ poems synchronously
- Generation of complete site (13,680+ pages) times out and cannot complete
Intended Behavior
- Generate chronological.html efficiently within 30 seconds
- Produce clean HTML output without embedded binary data or PDF content
- Create browser-friendly HTML that loads quickly
- Support incremental/parallel generation for better performance
- Generate all required pages (chronological, similar, unique) successfully
Suggested Implementation Steps
1. Fix Content Corruption Issue
The current output contains PDF binary data instead of poem content:
-- Check format_content_with_warnings function
-- Ensure poem.content is properly extracted text, not binary data
-- Verify apply_markdown_formatting doesn't corrupt content
2. Implement Chunked/Paginated Generation
-- {{{ function generate_chronological_chunks
local function generate_chronological_chunks(poems_data, chunk_size)
chunk_size = chunk_size or 100
local chunks = {}
for i = 1, #poems_data.poems, chunk_size do
local chunk_end = math.min(i + chunk_size - 1, #poems_data.poems)
table.insert(chunks, {
start = i,
finish = chunk_end,
poems = {table.unpack(poems_data.poems, i, chunk_end)}
})
end
return chunks
end
-- }}}
3. Add Progress Tracking and Logging
-- {{{ function generate_with_progress
local function generate_with_progress(poems_data, output_dir)
local total = #poems_data.poems
local processed = 0
for i, poem in ipairs(poems_data.poems) do
-- Process poem
processed = processed + 1
-- Log progress every 100 poems
if processed % 100 == 0 then
utils.log_info(string.format("Progress: %d/%d poems (%.1f%%)",
processed, total, (processed/total)*100))
end
end
end
-- }}}
4. Optimize Memory Usage
-- Stream writing instead of concatenating huge strings
local function write_chronological_streaming(poems_data, output_file)
local file = io.open(output_file, "w")
-- Write header
file:write(HTML_HEADER)
-- Stream poem content
for _, poem in ipairs(poems_data) do
local poem_html = format_single_poem(poem)
file:write(poem_html)
file:flush() -- Periodic flush to free memory
end
-- Write footer
file:write(HTML_FOOTER)
file:close()
end
5. Implement Parallel Generation
-- Generate similarity/unique pages in parallel using coroutines
local function generate_pages_parallel(poems_data, output_dir)
local tasks = {}
-- Create tasks for each poem's pages
for _, poem in ipairs(poems_data.poems) do
table.insert(tasks, coroutine.create(function()
generate_similarity_page(poem, output_dir)
generate_unique_page(poem, output_dir)
end))
end
-- Execute tasks with controlled concurrency
local max_concurrent = 10
execute_tasks_with_limit(tasks, max_concurrent)
end
6. Add Caching Layer
-- Cache processed poem HTML to avoid reprocessing
local poem_html_cache = {}
local function get_poem_html_cached(poem)
local cache_key = tostring(poem.id)
if not poem_html_cache[cache_key] then
poem_html_cache[cache_key] = format_poem_html(poem)
end
return poem_html_cache[cache_key]
end
7. Create Minimal Test Mode
-- Add option to generate with subset for testing
local function generate_test_site(poem_limit)
poem_limit = poem_limit or 100
local test_poems = {
poems = {table.unpack(poems_data.poems, 1, poem_limit)}
}
return generate_site(test_poems)
end
Performance Targets
- Chronological index generation: < 30 seconds for 7,000 poems
- Complete site generation: < 5 minutes for all 13,680+ pages
- Memory usage: < 500MB during generation
- File sizes: Chronological.html < 2MB (currently 12MB)
- Browser load time: < 3 seconds for chronological.html
Files to Modify
/src/flat-html-generator.lua- Main generation logic/src/html-generator/template-engine.lua- Template processing/libs/utils.lua- Add streaming/chunking utilities- Create new
/src/parallel-generator.luafor concurrent generation
Testing Requirements
- Verify generated HTML contains actual poem text, not binary data
- Confirm all navigation links work correctly
- Test browser performance with generated pages
- Validate generation completes within time limits
- Check memory usage stays within constraints
- Ensure incremental generation produces identical output
Priority
HIGH - Site generation is currently broken and unusable due to performance issues
Dependencies
- poems.json data file must be properly formatted
- Similarity matrices must be pre-calculated
- System must have sufficient memory for caching
Related Issues
- 5-025: Similarity matrix optimization (may help with memory usage)
- 5-013: Flat HTML recreation (related to output format)
Notes
The core issue appears to be that the generator is including raw PDF/binary data in the HTML output instead of extracted poem text. This needs to be fixed first before optimizing performance. The 12MB file size and browser crashes are symptoms of this data corruption issue.