issues/9-003-optimize-centroid-calculation-and-parallelization.md

Issue 9-003: Optimize Centroid Calculation and Parallelization

Current Behavior

The diversity sequence computation in scripts/precompute-diversity-sequences and scripts/generate-html-parallel recalculates the centroid from scratch at every iteration:

-- Current: O(N × 768) per iteration where N grows from 1 to 7,792
local function calculate_centroid(embeddings_list)
    for _, emb in ipairs(embeddings_list) do      -- O(N) loop
        for i = 1, dim do                          -- O(768) per embedding
            centroid[i] = centroid[i] + emb[i]
        end
    end
    for i = 1, dim do
        centroid[i] = centroid[i] / #embeddings_list
    end
end

Total centroid calculation cost per sequence:

  • Sum of (1 + 2 + 3 + ... + 7,792) × 768 = ~23.3 billion floating-point operations

Current parallelization model:

  • Each thread computes ONE complete sequence independently
  • Threads don't share work within a sequence
  • Results written to temp files on disk (SSD wear, I/O overhead)

Intended Behavior

Optimization 1: Incremental Running Sum (No Division)

Maintain a running sum instead of recalculating from scratch:

-- Optimized: O(768) per iteration
local running_sum = {}
for i = 1, 768 do running_sum[i] = starting_embedding[i] end

-- When finding most distant poem, use running_sum directly
-- (cosine distance is scale-invariant, so no division needed!)
local dist = cosine_distance(running_sum, candidate_embedding)

-- When adding a new poem:
for i = 1, 768 do
    running_sum[i] = running_sum[i] + selected_embedding[i]
end

Why no division is needed:
Cosine similarity: cos(A,B) = (A·B) / (||A|| × ||B||)
Scaling A by constant k: cos(kA,B) = k(A·B) / (k||A|| × ||B||) = cos(A,B)
The constant cancels out, so sum gives same ranking as sum/count.

New cost per sequence:

  • 7,792 iterations × 768 additions = ~6 million operations
  • Speedup: ~4,000× for centroid maintenance

Optimization 2: Parallel Distance Comparisons Within Sequence

Instead of each thread handling a separate starting poem, parallelize the distance comparisons within a single sequence:

-- Main thread orchestrates, workers split comparison work
for iteration = 1, 7792 do
    local current_sum = running_sum  -- Shared read

    -- Split remaining poems across K workers
    local local_maxes = parallel_map(workers, function(worker_id)
        local my_start = (worker_id - 1) * chunk_size + 1
        local my_end = math.min(worker_id * chunk_size, #remaining)

        local max_dist, max_idx = -1, -1
        for i = my_start, my_end do
            local dist = cosine_distance(current_sum, remaining[i].embedding)
            if dist > max_dist then
                max_dist, max_idx = dist, i
            end
        end
        return {dist = max_dist, idx = max_idx}
    end)

    -- Main thread: find global max from local maxes
    local global_max = reduce(local_maxes, function(a, b)
        return a.dist > b.dist and a or b
    end)

    -- Update running sum (only main thread writes)
    local selected = remaining[global_max.idx]
    for i = 1, 768 do
        running_sum[i] = running_sum[i] + selected.embedding[i]
    end
    table.remove(remaining, global_max.idx)
end

Benefits:

  • K threads = ~K× speedup on the O(remaining × 768) bottleneck
  • Main thread is O(768) for updates, O(K) for max reduction
  • With 8 threads: ~8× faster distance comparisons

Optimization 3: Eliminate Temp Files (RAM-Only)

Store results in RAM instead of writing to disk:

  • Avoid SSD write wear
  • Eliminate filesystem overhead
  • Use effil.channel or shared tables for result collection

Suggested Implementation Steps

Phase A: Incremental Running Sum (Standalone) ✅ COMPLETED

  • [x] Modify calculate_centroid() to maintain running sum
  • [x] Remove division (use sum directly for cosine distance)
  • [ ] Verify correctness: same sequence output as before (pending full test run)
  • [ ] Benchmark: measure time per sequence before/after

Implementation Notes (2025-12-23):

  • Worker function now maintains running_sum table initialized from starting poem
  • Distance calculation uses raw sum: cosine_distance(running_sum, candidate_embedding)
  • After selecting max-distance poem, running sum updated with O(768) operation:

```lua
for i = 1, embedding_dim do
running_sum[i] = running_sum[i] + selected_embedding[i]
end
```

Phase B: Parallel Distance Comparisons

  • [ ] Create worker pool for distance calculations
  • [ ] Implement chunk-based distribution of remaining poems
  • [ ] Add synchronization for global max finding
  • [ ] Benchmark: measure speedup with different thread counts

Note: This optimization is deferred pending GPU acceleration work (Issue 9-001).
The current model of one thread per starting poem remains effective.

Phase C: RAM-Only Storage ✅ COMPLETED

  • [x] Replace temp file writes with effil thread :get() return values
  • [x] Collect all sequences in main thread memory
  • [x] Write single output file at end
  • [x] Verify no temp files created during execution

Implementation Notes (2025-12-23):

  • Worker function returns sequence directly via return sequence
  • Main thread collects results via thread:get() which blocks until thread completes
  • All sequences accumulated in all_results table in RAM
  • Single JSON write to diversity_sequences.json at end

Phase D: Per-Thread File I/O ✅ COMPLETED (New Optimization)

  • [x] Eliminate effil.table conversion bottleneck (was 30-60 seconds)
  • [x] Each worker thread loads embeddings.json independently
  • [x] OS file cache ensures subsequent reads are fast

Implementation Notes (2025-12-23):

  • Worker signature changed: receives file path instead of effil.table
  • Each thread parses 62MB JSON independently (OS caches after first read)
  • Eliminated effil.table() conversion which was copying 5M+ values

Complexity Analysis

OperationCurrentOptimized
Centroid updateO(N × 768)O(768)
Distance comparisonsO(remaining × 768)O(remaining × 768 / K)
File I/O per sequence1 write + 1 read0
Total per sequenceO(N² × 768)O(N × 768 / K)

With K=8 threads and incremental centroid:

  • Current: ~42 hours for 7,793 sequences
  • Optimized: ~1-2 hours estimated

Mathematical Proof: Scale Invariance of Cosine Distance

For vectors A and B, cosine similarity is:

cos(A, B) = (A · B) / (||A|| × ||B||)

For scaled vector kA (where k is any positive scalar):

cos(kA, B) = (kA · B) / (||kA|| × ||B||)
           = k(A · B) / (k||A|| × ||B||)
           = (A · B) / (||A|| × ||B||)
           = cos(A, B)

Therefore, dividing the running sum by the count has no effect on which poem is selected as "most distant." We can use the raw sum.

Related Issues

  • Issue 9-001: Implement Vulkan compute infrastructure (GPU acceleration)
  • Issue 9-002: Port similarity matrix to Vulkan
  • Issue 9-003a: Remove unnecessary division from source files (sub-issue)
  • Issue 8-002: Implement multi-threaded HTML generation (current threading model)

Sub-Issues

  • 9-003a: The original calculate_embedding_centroid() in src/flat-html-generator.lua and calculate_ultra_centroid() in src/centroid-generator.lua still contain the unnecessary division. Created 2025-12-25 to address code consistency.

Files to Modify

  • scripts/precompute-diversity-sequences (lines 70-141)
  • scripts/generate-html-parallel (lines 320-410)

Dependencies

  • effil library for Lua multithreading
  • No new dependencies required

Phase: 9 (GPU Acceleration) - though Phase A can be implemented independently

Priority: High (significant performance improvement with modest code changes)

Created: 2025-12-23

Status: In Progress (Phase A, C, D complete; Phase B deferred to GPU work)

Last Updated: 2025-12-23