issues/8-002-implement-multithreaded-html-generation.md
Issue 8-002: Implement Multi-threaded HTML Page Generation
Current Behavior
The HTML page generation in src/flat-html-generator.lua is single-threaded. Generating all ~12,000 pages (6000+ similar + 6000+ different) takes a very long time because:
- Similarity pages: Fast to generate (simple sorting), but still limited by sequential I/O
- Difference pages: Slow due to O(n²) centroid-based diversity algorithm - each page requires iterating through all poems multiple times
Current generation time estimates:
- Similarity pages: ~5-10 minutes for all poems
- Difference pages: Could take hours for full generation (each page requires n iterations)
Intended Behavior
Use the effil threading library (available at /home/ritz/programming/ai-stuff/libs/lua/effil-jit for LuaJIT) to parallelize HTML page generation:
- Thread-per-page model: Each page (similar/XXX.html or different/XXX.html) generated in its own thread
- Configurable thread pool: Limit concurrent threads based on CPU cores (e.g., 8-16 threads)
- Shared read-only data: Poems, embeddings, and similarity data loaded once, shared across threads
- Independent file writes: Each thread writes to its own unique output file
Expected speedup: 8-16x on typical multi-core systems.
Implementation Steps
Step 1: Set up effil library integration ✅ COMPLETED
- [x] Add effil library path to package.cpath:
/home/ritz/programming/ai-stuff/libs/lua/effil-jit/build - [x] Test basic effil functionality with simple parallel task
- [x] Note: effil.thread() doesn't return values directly, use file existence check instead
Step 2: Refactor data loading for thread safety ✅ COMPLETED
- [x] Load poems.json once in main thread
- [x] Load similarity_matrix.json once in main thread
- [x] Load poem_colors.json once in main thread
- [x] Convert data structures to effil-shareable format (effil.table)
Step 3: Implement parallel similarity page generation ✅ COMPLETED
- [x] Create worker function for single similarity page
- [x] Implement batch-based thread pool for similarity pages
- [x] Add progress reporting across batches
- [x] Verify success by checking file existence after thread completion
Step 4: Implement parallel difference page generation ✅ COMPLETED
- [x] Create worker function for single difference page (centroid-based diversity)
- [x] Implement thread pool for difference pages
- [x] Load 62MB embeddings file and share via effil.table
- [x] Test: 10 pages generated successfully, 8.5MB each with all ~7800 poems
- [x] Note: O(n²) algorithm is slow (~25 sec/page) - optimization needed
Step 5: Integrate into main pipeline
- [x] Add thread count argument (default 8)
- [x] Add
--testflag for limited test runs - [x] Add
--similar-onlyand--different-onlyflags - [ ] Integrate into run.sh or main.lua
Step 6: Optimize difference page generation ✅ COMPLETED (Option C)
- [ ] Option A: Limit diversity sequence to first N poems (e.g., 500)
- [ ] Option B: Use inverse similarity instead of centroid-based algorithm
- [x] Option C: Pre-compute all diversity sequences and cache to disk
- [x] Created
scripts/precompute-diversity-sequenceswith thermal management - [x] Cache stored in
assets/embeddings/EmbeddingGemma_latest/diversity_cache.json - [x] Batch-based threading with configurable sleep between batches
- [x] Updated
generate-html-parallelto use cache when available - [x] Estimated one-time computation: ~42 hours (but runs unattended)
- [x] Once cached, difference pages generate at ~10 pages/sec (same as similarity)
Cache Invalidation: The diversity_cache.json must be regenerated when:
- Similarity algorithm changes (cosine → euclidean, etc.)
- Embedding model changes
- New poems are added to the corpus
Technical Notes
effil Library Usage
-- Add to package.cpath
package.cpath = "/home/ritz/programming/ai-stuff/libs/lua/effil-jit/?.so;" .. package.cpath
local effil = require("effil")
-- Create shared data
local shared_poems = effil.table(poems_data)
-- Create thread pool
local threads = {}
for i = 1, num_threads do
threads[i] = effil.thread(worker_function)(shared_poems, poem_id)
end
-- Wait for all threads
for _, thread in ipairs(threads) do
thread:wait()
end
Thread Safety Considerations
- Read-only data: Poems, embeddings, similarity matrix are read-only after loading
- File writes: Each thread writes to unique file, no contention
- Progress tracking: Use atomic counter for progress reporting
- Error handling: Catch errors per-thread, report failures at end
Performance Optimization for Difference Pages
The diversity algorithm is O(n²) per page. Consider:
- Pre-compute full diversity sequences: Generate once, cache to disk
- Batch centroid calculation: Optimize vector operations
- Early termination: For test/preview, limit to first N poems
Dependencies
- effil library compiled for LuaJIT at
/home/ritz/programming/ai-stuff/libs/lua/effil-jit - LuaJIT (not standard Lua 5.x)
Quality Assurance Criteria
- [ ] Generation completes without thread-related crashes
- [ ] Output files identical to single-threaded generation
- [ ] Progress reporting shows parallel activity
- [ ] Memory usage reasonable (shared data, not duplicated)
- [ ] Error in one thread doesn't kill entire generation
- [ ] Speedup of at least 4x on 8-core system
Related Issues
- Issue 8-001: Integrate complete HTML generation into pipeline
- Issue 5-026: Optimize chronological HTML generation performance
ISSUE STATUS: IN PROGRESS
Created: 2025-12-14
Phase: 8 (Website Completion)
Priority: High (blocking full generation)
Progress Summary
Completed:
scripts/generate-html-parallelcreated using effil library- Similarity page generation working in parallel (10 pages/sec with 4 threads in test)
- CSS-free HTML output using
<font color="">tags - Batch-based thread pool with progress reporting
- Difference page generation working (centroid-based diversity algorithm)
- Embeddings loading and sharing via effil.table (5.9M values)
- Option C optimization: Pre-computation with
scripts/precompute-diversity-sequences - Thermal management with configurable sleep between batches
- Cache-based fast path in generate-html-parallel
Performance Metrics:
- Similarity pages: 10 pages/sec (fast)
- Difference pages (on-the-fly): ~0.04 pages/sec (~25 sec/page)
- Difference pages (cached): ~10 pages/sec (same as similarity)
- Pre-computation: ~42 hours one-time cost
Remaining:
- Run pre-computation to generate diversity_cache.json
- Integration into main pipeline (run.sh or main.lua)
- Full-scale testing with all ~6000+ poems
Critical Issues Discovered (2025-12-14)
Bug 1: Index Mapping Mismatch
The precompute-diversity-sequences script has a critical bug in how it maps poem indices to embedding positions:
-- Current (BUGGY) code:
for i, poem in ipairs(poems_data.poems) do
if poem.id and embeddings_data.embeddings[i] and embeddings_data.embeddings[i].embedding then
for _, val in ipairs(embeddings_data.embeddings[i].embedding) do
table.insert(all_embeddings_flat, val) -- Sequential flat index
end
poem_id_to_index[poem.id] = i -- BUG: Uses ORIGINAL index, not flat index!
end
end
Problem: The flat embeddings array is built sequentially (only poems WITH embeddings), but poem_id_to_index stores the ORIGINAL poem array index. If any poems lack embeddings, the indices don't match.
Impact: Worker threads read wrong embedding data or out-of-bounds memory, causing garbage results or hangs.
Root Cause Analysis: The embedding generation system (src/similarity-engine.lua) can fail to generate embeddings for poems due to:
- Empty poem content (lines 425-432)
- Network/API errors (network_error, connection_error, empty_response, parse_error)
- Invalid embedding dimensions
Current data shows ~7,355 poems but ~6,641 embeddings = ~714 poems without embeddings.
Bug 2: Catastrophic effil.table Access Overhead
The worker function accesses the shared effil.table directly for every embedding lookup:
local function get_embedding(idx)
for j = 1, embedding_dim do
embedding[j] = all_embeddings_flat[emb_start + j - 1] -- effil.table access!
end
end
Performance Analysis:
- ~6,640 outer loop iterations per sequence
- Each iteration: ~3,320 average calls to
get_embedding - Each
get_embedding: 768 effil.table accesses - Plus centroid calculations accessing embeddings
Total: ~17 BILLION effil.table accesses per sequence
Each effil.table access involves cross-thread synchronization overhead. Even at 1 microsecond per access, this equals ~5 HOURS per sequence, not 25 seconds as originally estimated.
Recommendation
The effil library is unsuitable for this workload due to the massive number of cross-thread data accesses. Alternative approaches should be evaluated:
- Fix and optimize effil approach: Copy effil.table to local Lua table at worker start (one-time 5.9M value copy per thread)
- Single-threaded with progress: Accept longer runtime, add checkpointing
- GPU compute shaders: Offload vector math to GPU via Vulkan/OpenGL compute
- Process-based parallelism: Spawn separate processes instead of threads (no shared memory overhead)
See: docs/effil-vs-compute-shader-feasibility.md for detailed comparison.
Bug 2 Fix Applied (2025-12-20)
Solution Implemented: Option 1 - Copy effil.table to local Lua tables at worker start
All three worker functions in scripts/generate-html-parallel now copy effil.tables to local Lua tables at the beginning of execution:
- similarity_worker: Copies
all_poems_array,similarities_for_poem,poem_colors_table - diversity_worker: Copies
all_poems_array,all_embeddings_flat,starting_embedding_flat,poem_colors_table - cached_diversity_worker: Copies
diversity_sequence,all_poems_lookup,poem_colors_table
Performance Impact:
- Before: ~17B IPC calls per diversity sequence (~5 hours each)
- After: O(n) one-time copy, then O(1) local table access
Bug 3 Fix Applied (2025-12-21)
Problem: effil.atomic is nil - function doesn't exist in installed effil version
The progress display originally used effil.atomic() counters for real-time per-thread progress updates. However, the installed effil library (effil-jit build) doesn't include the atomic function - only has: thread, channel, table, size.
Error encountered:
luajit: src/similarity-engine-parallel.lua:356: attempt to call field 'atomic' (a nil value)
Solution Implemented: File-based progress polling
Instead of atomic counters, the main thread now polls the output directory for completed files:
- Removed all
effil.atomic()counter creation and usage - Added
count_output_files()function that usesfind | wc -lfor fast file counting - Progress display now polls file count every 0.5 seconds
- Simplified from per-thread progress lines to single aggregate progress bar
Progress display format:
[██████████░░░░░░░░░░░░░░░░░░░░] 500/7000 (7.1%) │ 0.83/s │ ETA: 7819s
This approach is simpler and doesn't require any shared memory between threads.
Per-Thread Progress Display (2025-12-21)
Enhancement: Per-thread progress bars using effil.channel
User requested individual progress bars for each thread instead of a single aggregate bar.
Implementation:
- Create shared
effil.channel()before spawning threads - Each thread sends
(thread_id, processed_count)to channel after each poem - Main thread drains channel with
pop(0)(non-blocking) every 0.5 seconds - Display one progress bar per thread, plus summary line
Progress display format:
Thread 1: [████████████░░░░░░░░] 600/1000 ( 60.0%)
Thread 2: [██████████░░░░░░░░░░] 500/1000 ( 50.0%)
Thread 3: [████████░░░░░░░░░░░░] 400/1000 ( 40.0%)
Thread 4: [██████░░░░░░░░░░░░░░] 300/1000 ( 30.0%)
─── Total: 1800/4000 (45.0%) │ 0.95/s │ ETA: 2316s
Key insight: effil.channel provides a thread-safe message queue that works across threads, unlike effil.atomic which isn't available in the effil-jit build.
Graceful Ctrl+C Interruption (2025-12-21)
Enhancement: Added SIGINT (Ctrl+C) signal handling for graceful shutdown
Implementation:
- Uses LuaJIT FFI to install a C signal handler for SIGINT (signal 2)
- When Ctrl+C is pressed, sets
interrupted = trueflag - Main polling loop checks flag every 0.5 seconds
- On interrupt: breaks loop, waits 5 seconds for threads to finish current poem
- Reports partial results and reminds user they can resume
Code pattern (LuaJIT FFI signal handling):
local ffi = require("ffi")
ffi.cdef[[
typedef void (*sighandler_t)(int);
sighandler_t signal(int signum, sighandler_t handler);
]]
local SIGINT = 2
local interrupted = false
local function on_interrupt(sig)
interrupted = true
end
-- Must keep reference to prevent garbage collection
local handler = ffi.cast("sighandler_t", on_interrupt)
ffi.C.signal(SIGINT, handler)
User experience:
💡 Press Ctrl+C to gracefully stop (threads will finish current poem)
Thread 1: [████████░░░░░░░░░░░░] 400/1000 ( 40.0%)
...
^C
⚠️ Ctrl+C detected! Waiting for threads to finish current poem...
⏳ Giving threads 5 seconds to complete current work...
⏸️ Similarity calculation interrupted by user
💡 Run again to resume - already-completed files will be skipped
TUI Menu Integration (2025-12-21)
Enhancement: Replaced simple text menu with full TUI framebuffer menu
The main menu now uses the /home/ritz/programming/ai-stuff/scripts/libs/menu.lua TUI library for:
- Interactive vim-style navigation (j/k, arrows)
- Checkbox toggles with keyboard shortcuts
- Flag/numeric input fields
- Dependency-based item enabling/disabling
- Graceful Ctrl+C handling built into TUI (returns "CTRL_C" key event)
Features:
- Action selection: Calculate vs Check Status (radio-button style)
- Options section: Force regenerate, Sleep duration, Model name
- Dependencies: Force and Sleep only enabled when Calculate is selected
- Fallback: If TUI unavailable, uses
main_text_mode()text-based menu
Usage: Run with -I flag as before - TUI will launch automatically if available.
Fail-Fast Error Handling (2025-12-23)
Change: Replaced graceful failure (error counting + continue) with fail-fast error handling.
Rationale: Per project CLAUDE.md guidelines:
"prefer error messages and breaking functionality over fallbacks. Be sure to notify the
user every time a fallback is used, and create a new issue file to resolve any fallbacks"
Silent failures lead to corrupt data. If a poem fails to process, continuing to the next one
means the final dataset will be incomplete without the user realizing it. Better to fail hard
with actionable diagnostics so the root cause can be identified and fixed.
Implementation:
calculate_poem_similarities(): Now validates all inputs before processing:
- Checks poem_data is not nil
- Checks embedding exists and is a valid vector
- Checks all comparison targets have embeddings
- Error messages include poem ID, context, and remediation steps
process_poem_batch(): Wraps calculation in pcall to add thread context, then re-raises:
- Adds thread ID and batch progress to error messages
- Returns (processed, 0) because errors cause immediate failure
- Inline thread worker: Same validation and fail-fast pattern:
- Validates embedding exists for each poem
- Validates each comparison target
- Validates JSON encoding, file write, and rename operations
- All failures include specific remediation steps
Error Message Format:
Thread X FAILED: [specific failure type]
Context: [what was being attempted]
[relevant data: poem ID, file paths, etc.]
Remedy: [specific action to fix the issue]
Related: Updated docs/data-flow-architecture.md to reflect fail-fast philosophy.