issues/9-003-optimize-centroid-calculation-and-parallelization.md
Issue 9-003: Optimize Centroid Calculation and Parallelization
Current Behavior
The diversity sequence computation in scripts/precompute-diversity-sequences and scripts/generate-html-parallel recalculates the centroid from scratch at every iteration:
-- Current: O(N × 768) per iteration where N grows from 1 to 7,792
local function calculate_centroid(embeddings_list)
for _, emb in ipairs(embeddings_list) do -- O(N) loop
for i = 1, dim do -- O(768) per embedding
centroid[i] = centroid[i] + emb[i]
end
end
for i = 1, dim do
centroid[i] = centroid[i] / #embeddings_list
end
end
Total centroid calculation cost per sequence:
- Sum of (1 + 2 + 3 + ... + 7,792) × 768 = ~23.3 billion floating-point operations
Current parallelization model:
- Each thread computes ONE complete sequence independently
- Threads don't share work within a sequence
- Results written to temp files on disk (SSD wear, I/O overhead)
Intended Behavior
Optimization 1: Incremental Running Sum (No Division)
Maintain a running sum instead of recalculating from scratch:
-- Optimized: O(768) per iteration
local running_sum = {}
for i = 1, 768 do running_sum[i] = starting_embedding[i] end
-- When finding most distant poem, use running_sum directly
-- (cosine distance is scale-invariant, so no division needed!)
local dist = cosine_distance(running_sum, candidate_embedding)
-- When adding a new poem:
for i = 1, 768 do
running_sum[i] = running_sum[i] + selected_embedding[i]
end
Why no division is needed:
Cosine similarity: cos(A,B) = (A·B) / (||A|| × ||B||)
Scaling A by constant k: cos(kA,B) = k(A·B) / (k||A|| × ||B||) = cos(A,B)
The constant cancels out, so sum gives same ranking as sum/count.
New cost per sequence:
- 7,792 iterations × 768 additions = ~6 million operations
- Speedup: ~4,000× for centroid maintenance
Optimization 2: Parallel Distance Comparisons Within Sequence
Instead of each thread handling a separate starting poem, parallelize the distance comparisons within a single sequence:
-- Main thread orchestrates, workers split comparison work
for iteration = 1, 7792 do
local current_sum = running_sum -- Shared read
-- Split remaining poems across K workers
local local_maxes = parallel_map(workers, function(worker_id)
local my_start = (worker_id - 1) * chunk_size + 1
local my_end = math.min(worker_id * chunk_size, #remaining)
local max_dist, max_idx = -1, -1
for i = my_start, my_end do
local dist = cosine_distance(current_sum, remaining[i].embedding)
if dist > max_dist then
max_dist, max_idx = dist, i
end
end
return {dist = max_dist, idx = max_idx}
end)
-- Main thread: find global max from local maxes
local global_max = reduce(local_maxes, function(a, b)
return a.dist > b.dist and a or b
end)
-- Update running sum (only main thread writes)
local selected = remaining[global_max.idx]
for i = 1, 768 do
running_sum[i] = running_sum[i] + selected.embedding[i]
end
table.remove(remaining, global_max.idx)
end
Benefits:
- K threads = ~K× speedup on the O(remaining × 768) bottleneck
- Main thread is O(768) for updates, O(K) for max reduction
- With 8 threads: ~8× faster distance comparisons
Optimization 3: Eliminate Temp Files (RAM-Only)
Store results in RAM instead of writing to disk:
- Avoid SSD write wear
- Eliminate filesystem overhead
- Use effil.channel or shared tables for result collection
Suggested Implementation Steps
Phase A: Incremental Running Sum (Standalone) ✅ COMPLETED
- [x] Modify
calculate_centroid()to maintain running sum - [x] Remove division (use sum directly for cosine distance)
- [ ] Verify correctness: same sequence output as before (pending full test run)
- [ ] Benchmark: measure time per sequence before/after
Implementation Notes (2025-12-23):
- Worker function now maintains
running_sumtable initialized from starting poem - Distance calculation uses raw sum:
cosine_distance(running_sum, candidate_embedding) - After selecting max-distance poem, running sum updated with O(768) operation:
```lua
for i = 1, embedding_dim do
running_sum[i] = running_sum[i] + selected_embedding[i]
end
```
Phase B: Parallel Distance Comparisons
- [ ] Create worker pool for distance calculations
- [ ] Implement chunk-based distribution of remaining poems
- [ ] Add synchronization for global max finding
- [ ] Benchmark: measure speedup with different thread counts
Note: This optimization is deferred pending GPU acceleration work (Issue 9-001).
The current model of one thread per starting poem remains effective.
Phase C: RAM-Only Storage ✅ COMPLETED
- [x] Replace temp file writes with effil thread
:get()return values - [x] Collect all sequences in main thread memory
- [x] Write single output file at end
- [x] Verify no temp files created during execution
Implementation Notes (2025-12-23):
- Worker function returns sequence directly via
return sequence - Main thread collects results via
thread:get()which blocks until thread completes - All sequences accumulated in
all_resultstable in RAM - Single JSON write to
diversity_sequences.jsonat end
Phase D: Per-Thread File I/O ✅ COMPLETED (New Optimization)
- [x] Eliminate effil.table conversion bottleneck (was 30-60 seconds)
- [x] Each worker thread loads embeddings.json independently
- [x] OS file cache ensures subsequent reads are fast
Implementation Notes (2025-12-23):
- Worker signature changed: receives file path instead of effil.table
- Each thread parses 62MB JSON independently (OS caches after first read)
- Eliminated
effil.table()conversion which was copying 5M+ values
Complexity Analysis
| Operation | Current | Optimized |
|---|---|---|
| Centroid update | O(N × 768) | O(768) |
| Distance comparisons | O(remaining × 768) | O(remaining × 768 / K) |
| File I/O per sequence | 1 write + 1 read | 0 |
| Total per sequence | O(N² × 768) | O(N × 768 / K) |
With K=8 threads and incremental centroid:
- Current: ~42 hours for 7,793 sequences
- Optimized: ~1-2 hours estimated
Mathematical Proof: Scale Invariance of Cosine Distance
For vectors A and B, cosine similarity is:
cos(A, B) = (A · B) / (||A|| × ||B||)
For scaled vector kA (where k is any positive scalar):
cos(kA, B) = (kA · B) / (||kA|| × ||B||)
= k(A · B) / (k||A|| × ||B||)
= (A · B) / (||A|| × ||B||)
= cos(A, B)
Therefore, dividing the running sum by the count has no effect on which poem is selected as "most distant." We can use the raw sum.
Related Issues
- Issue 9-001: Implement Vulkan compute infrastructure (GPU acceleration)
- Issue 9-002: Port similarity matrix to Vulkan
- Issue 9-003a: Remove unnecessary division from source files (sub-issue)
- Issue 8-002: Implement multi-threaded HTML generation (current threading model)
Sub-Issues
- 9-003a: The original
calculate_embedding_centroid()insrc/flat-html-generator.luaandcalculate_ultra_centroid()insrc/centroid-generator.luastill contain the unnecessary division. Created 2025-12-25 to address code consistency.
Files to Modify
scripts/precompute-diversity-sequences(lines 70-141)scripts/generate-html-parallel(lines 320-410)
Dependencies
- effil library for Lua multithreading
- No new dependencies required
Phase: 9 (GPU Acceleration) - though Phase A can be implemented independently
Priority: High (significant performance improvement with modest code changes)
Created: 2025-12-23
Status: In Progress (Phase A, C, D complete; Phase B deferred to GPU work)
Last Updated: 2025-12-23