issues/9-002b-validate-gpu-similarity-implementation.md
Issue 9-002b: Validate GPU Similarity Implementation
Parent Issue
9-002: Port Similarity Matrix Generation to Vulkan
Current Behavior
GPU similarity matrix generation is the validated, production default for
Stage 7. There is now exactly one GPU code path:
- The full-matrix parallel path computes every poem pair in a single GPU
dispatch, then writes the per-poem similarity files and the rankings cache
with C pthreads. (generate_similarity_matrix_gpu_parallel in
vk_similarity.lua → vks_compute_all_similarities_parallel in
vk_similarity.c.)
- A CPU-only fallback still exists in
run.sh(--cpu-only) for machines
without a working Vulkan device.
The earlier sequential per-poem GPU path was removed (it pre-dated the
single-dispatch design and was never used in production). It re-uploaded one
"source" embedding per poem and ran ~7,800 small dispatches; the parallel path
does the same work in one dispatch. Removing it also dropped its scaffolding
from the C context: the host-side CPU copy of the embeddings, the per-source
GPU buffer, the sequential output buffer, and the batch shader pipeline. A
single-threaded run is now simply --threads=1, which only narrows CPU
sort/write fan-out and never touches GPU compute. Removed entry points:vks_compute_similarities_for_poem and vks_compute_all_similarities (C +
header + Lua FFI), and the deprecated Lua generate_similarity_matrix_gpu.
Validation status of the surviving parallel path:
- Runs on the full poem dataset as the default Stage 7 path
- Speedup vs CPU documented in commit history
- Per-pair results match the CPU reference within float32 tolerance
Intended Behavior
- GPU similarity generation validated on full dataset
- Performance benchmarked and documented
- Results verified to match CPU implementation (within float32 precision tolerance)
- GPU path becomes the default in run.sh
- Confidence that GPU implementation is production-ready
Implementation Steps
Step 1: Create Test Script
- [ ] Create
libs/vulkan-compute/test-similarity.lua - [ ] Test on small subset (10 poems) first
- [ ] Validate output JSON format matches CPU format
- [ ] Verify triangular storage correctness
Step 2: Validation Test
- [ ] Generate similarities for 100 poems using GPU
- [ ] Generate same 100 poems using CPU
- [ ] Compare results file-by-file
- [ ] Document precision differences (should be < 1e-5)
- [ ] Verify all metadata fields match
Step 3: Performance Benchmark
- [ ] Run GPU similarity on full 7,797 poems
- [ ] Measure total time and poems/second rate
- [ ] Run CPU similarity on subset (e.g., 1,000 poems)
- [ ] Extrapolate full dataset time for CPU
- [ ] Calculate speedup factor
- [ ] Document memory usage
Step 4: Full Integration Test
- [ ] Run full pipeline with GPU similarities enabled
- [ ] Generate HTML pages using GPU-computed similarities
- [ ] Verify website looks correct
- [ ] Check for any numerical precision issues in rankings
Step 5: Production Readiness
- [ ] Update run.sh to use GPU by default (remove --cpu-only requirement)
- [ ] Add command line flag
--cpu-onlyto force CPU mode - [ ] Update documentation
- [ ] Update Phase 9 progress
Test Plan
Small-Scale Test (10 poems)
# Test GPU implementation
cd libs/vulkan-compute
./test-similarity.lua --num-poems 10 --output /tmp/gpu-test
# Compare with CPU
cd ../..
luajit -e "
package.path = 'src/?.lua;' .. package.path
local sim = require('similarity-engine-parallel')
sim.calculate_similarity_matrix_parallel(
'assets/embeddings/embeddinggemma_latest/embeddings.json',
'embeddinggemma:latest',
0, -- no sleep
true, -- force
1 -- single thread
)
"
# Compare outputs
diff -r assets/embeddings/embeddinggemma_latest/similarities /tmp/gpu-test
Full-Scale Benchmark
# GPU benchmark
time ./run.sh --generate-similarities --force
# CPU benchmark (subset)
# (Manually edit run.sh to limit to 1000 poems)
time ./run.sh --generate-similarities --cpu-only --force
Quality Assurance Criteria
- [ ] Test script created and working
- [ ] GPU output format matches CPU format exactly
- [ ] Precision differences < 1e-5 for float32
- [ ] Performance benchmarked on full dataset
- [ ] GPU implementation at least 5x faster than CPU
- [ ] Full pipeline test successful
- [ ] No visual artifacts in generated HTML
- [ ] Production deployment approved
Expected Performance
Based on similarity_batch.comp design and GTX 1080 Ti specs:
| Metric | CPU (effil, 8 threads) | GPU (expected) | Speedup |
|---|---|---|---|
| Full matrix (7,797 poems) | ~30 minutes | ~3-5 minutes | 6-10× |
| Poems/second | ~4-5 | ~25-40 | 6-10× |
| Memory usage | ~500 MB RAM | ~50 MB VRAM | Efficient |
Note: Expected speedup is lower than diversity cache (996×) because:
- Similarity computation must write 7,797 individual JSON files (I/O bound)
- File writing is single-threaded in Lua
- GPU benefit is primarily in cosine similarity computation, not file I/O
Dependencies
- 9-002a (Shader design) ✅ COMPLETED
- 9-001b (Vulkan wrapper) ✅ COMPLETED
- 9-001c (Cosine distance shader) ✅ COMPLETED
Related Files
libs/vulkan-compute/lua/vk_similarity.lua- Implementation to testsrc/similarity-engine-parallel.lua- CPU reference implementationrun.shlines 574-595 - GPU integration pointlibs/vulkan-compute/shaders/similarity_batch.comp- GPU kernel
ISSUE STATUS: OPEN
Created: 2026-01-17
Phase: 9 (GPU Acceleration)
Priority: High (blocks production use of GPU similarity)