issues/9-004-gpu-accelerate-maze-algorithm.md
Issue 9-004: GPU-Accelerate Maze Algorithm
Overview
Accelerate the dimension-extreme + similarity-filter maze algorithm (Issue 11-002) using the Vulkan compute infrastructure (Issue 9-001). This issue is blocked by both Phase 9 infrastructure AND Phase 11 algorithm design.
Current Behavior (CPU Implementation - Issue 11-002)
The maze algorithm has two stages:
Stage 1: Dimension-Extreme Computation
-- For each poem P (7,793):
-- For each dimension D (768):
-- Find poem Q where |Q.embedding[D] - P.embedding[D]| is maximum
--
-- Naive: O(7,793 × 768 × 7,793) = 46 billion comparisons
-- With pre-sorting: O(768 × n log n) + O(n × 768) = ~81 million ops
Stage 2: Similarity Filtering
-- For each poem P (7,793):
-- For each of ~500 unique dimension-extreme candidates:
-- Compute full cosine similarity (768 dimensions)
-- Select top 6 by similarity
--
-- Total: O(7,793 × 500 × 768) = ~3 billion ops
CPU Estimated Time: ~3-5 minutes total
Intended Behavior (GPU Implementation)
Use Vulkan compute shaders to parallelize both stages:
GPU Stage 1: Parallel Dimension-Extreme Finding
Approach A: Naive Parallel (Simple, Memory-Heavy)
For each dimension D (768 dispatches):
Dispatch shader that computes |embedding[D] - current_val| for all poems
Parallel reduction to find argmax
Store result
Memory: 7,793 floats per dispatch = 31KB (trivial)
Dispatches: 768 × 7,793 = 5.9 million (too many!)
Approach B: Per-Poem Parallel (Better)
For each poem P (7,793 dispatches):
Upload P's embedding values to uniform buffer
Dispatch shader with 768 workgroups (one per dimension)
Each workgroup finds the extreme for its dimension
Write 768 poem IDs to output buffer
Dispatches: 7,793 (manageable)
Per-dispatch work: 768 workgroups × 7,793 comparisons = 6M ops (fast)
Approach C: Fully Batched (Optimal)
Single dispatch with 2D workgroups:
Dimension (X): 768 workgroups
Poem batch (Y): 128 poems per dispatch
Pre-upload all embeddings (~20MB) once
For each batch of 128 poems:
Dispatch 768 × 128 workgroups
Each finds dimension-extreme for (poem, dimension) pair
Write to structured output buffer
Dispatches: ceil(7,793 / 128) = 61 dispatches
Total GPU time: ~100ms estimated
GPU Stage 2: Parallel Cosine Similarity
Reuse the cosine distance shader from Issue 9-001c with modifications:
// Instead of comparing to a single centroid, compare to N candidates
// Output: similarity scores for top-K selection
layout(set = 0, binding = 0) readonly buffer Embeddings {
float embeddings[]; // All 7,793 × 768 floats
};
layout(set = 0, binding = 1) readonly buffer CandidateIndices {
uint candidates[]; // 768 candidate poem IDs per source poem
};
layout(set = 0, binding = 2) writeonly buffer Similarities {
float similarities[]; // 768 scores per source poem
};
// Dispatch: one workgroup per (source_poem, candidate) pair
// Or: one workgroup per source_poem, loop over candidates
After GPU computes similarities, CPU performs simple top-6 selection (negligible time).
Compute Shader Designs
Shader 1: dimension_extreme.comp
#version 450
layout(local_size_x = 256) in;
layout(set = 0, binding = 0) readonly buffer Embeddings {
float embeddings[]; // 7,793 × 768 floats
};
layout(set = 0, binding = 1) readonly buffer SourceValues {
float source_values[768]; // Current poem's embedding
};
layout(set = 0, binding = 2) writeonly buffer Extremes {
uint extreme_ids[]; // 768 poem IDs (output)
};
layout(push_constant) uniform Constants {
uint num_poems;
uint source_poem_id; // To exclude from search
};
shared float s_max_diff[256];
shared uint s_max_idx[256];
void main() {
uint dim = gl_WorkGroupID.x; // Which dimension (0-767)
uint tid = gl_LocalInvocationID.x;
uint poems_per_thread = (num_poems + 255) / 256;
float source_val = source_values[dim];
float local_max_diff = 0.0;
uint local_max_idx = 0;
// Each thread scans a portion of poems
for (uint i = 0; i < poems_per_thread; i++) {
uint poem_id = tid * poems_per_thread + i;
if (poem_id >= num_poems || poem_id == source_poem_id) continue;
float val = embeddings[poem_id * 768 + dim];
float diff = abs(val - source_val);
if (diff > local_max_diff) {
local_max_diff = diff;
local_max_idx = poem_id;
}
}
// Store in shared memory
s_max_diff[tid] = local_max_diff;
s_max_idx[tid] = local_max_idx;
barrier();
// Parallel reduction to find global max
for (uint stride = 128; stride > 0; stride >>= 1) {
if (tid < stride) {
if (s_max_diff[tid + stride] > s_max_diff[tid]) {
s_max_diff[tid] = s_max_diff[tid + stride];
s_max_idx[tid] = s_max_idx[tid + stride];
}
}
barrier();
}
// Thread 0 writes result for this dimension
if (tid == 0) {
extreme_ids[dim] = s_max_idx[0];
}
}
Shader 2: batch_cosine_similarity.comp
Reuse/extend 9-001c for batch candidate comparison:
#version 450
layout(local_size_x = 64) in;
layout(set = 0, binding = 0) readonly buffer Embeddings {
float embeddings[];
};
layout(set = 0, binding = 1) readonly buffer SourceEmbedding {
float source[768];
};
layout(set = 0, binding = 2) readonly buffer CandidateIDs {
uint candidate_ids[]; // Up to 768 candidate poem IDs
};
layout(set = 0, binding = 3) writeonly buffer Similarities {
float similarities[];
};
layout(push_constant) uniform Constants {
uint num_candidates;
};
void main() {
uint cand_idx = gl_GlobalInvocationID.x;
if (cand_idx >= num_candidates) return;
uint poem_id = candidate_ids[cand_idx];
uint base = poem_id * 768;
float dot = 0.0, norm1 = 0.0, norm2 = 0.0;
for (uint i = 0; i < 768; i++) {
float a = source[i];
float b = embeddings[base + i];
dot += a * b;
norm1 += a * a;
norm2 += b * b;
}
similarities[cand_idx] = dot / (sqrt(norm1) * sqrt(norm2));
}
Host-Side Algorithm
void compute_maze_exits_gpu(VkComputeContext* ctx,
float* embeddings, // 7,793 × 768 floats
uint32_t num_poems,
MazeCache* output) {
// Upload embeddings once (20MB → GPU)
upload_embeddings(ctx, embeddings, num_poems);
for (uint32_t poem_id = 0; poem_id < num_poems; poem_id++) {
// Stage 1: Find 768 dimension extremes
upload_source_embedding(ctx, &embeddings[poem_id * 768]);
dispatch_dimension_extreme_shader(ctx, num_poems, poem_id);
uint32_t extremes[768];
download_extreme_ids(ctx, extremes);
// Deduplicate extremes (CPU - fast)
uint32_t unique[768];
uint32_t num_unique = deduplicate(extremes, 768, unique);
// Stage 2: Compute similarities to unique candidates
upload_candidate_ids(ctx, unique, num_unique);
dispatch_batch_similarity_shader(ctx, num_unique);
float similarities[768];
download_similarities(ctx, similarities, num_unique);
// Select top 6 (CPU - fast)
select_top_k(unique, similarities, num_unique, 6, output->exits[poem_id]);
}
}
Performance Analysis
CPU Baseline (Issue 11-002)
| Stage | Time |
|---|---|
| Dimension-extreme (all poems) | ~60 seconds |
| Similarity filtering (all poems) | ~120 seconds |
| Total | ~3 minutes |
GPU Accelerated
| Stage | Time | Speedup |
|---|---|---|
| Upload embeddings | 50ms | N/A |
| Dimension-extreme (per poem) | 1-2ms | 10x |
| Similarity filtering (per poem) | 0.5ms | 20x |
| Download + CPU postprocess | 1ms | N/A |
| Per-poem total | ~4ms | 15x |
| All 7,793 poems | ~31 seconds | 6x |
Note: The CPU baseline is already fast (~3 min). GPU acceleration provides 6x speedup but may not be worth the complexity unless combined with other GPU workloads in the same session.
Combined Benefit
If running maze + diversity + similarity matrix in same session:
- Single embeddings upload (~50ms)
- Reuse shaders across all algorithms
- Total session time: ~30 minutes (vs. hours without GPU)
Suggested Implementation Steps
Step 1: Implement dimension_extreme.comp Shader
- [ ] Write GLSL compute shader
- [ ] Compile to SPIR-V
- [ ] Validate with test data
Step 2: Extend Cosine Distance Shader for Batching
- [ ] Modify 9-001c to accept candidate ID list
- [ ] Add batch processing support
Step 3: Host-Side Integration
- [ ] Add maze-specific functions to Vulkan wrapper
- [ ] Implement upload/download for maze data structures
Step 4: Lua/C Interface
- [ ] Expose GPU maze computation to Lua
- [ ] Fallback to CPU if GPU unavailable
Step 5: Integrate with Maze Pipeline
- [ ] Modify 11-002a to use GPU when available
- [ ] Add GPU/CPU mode flag
Dependencies
Blocked By:
- 9-001b: Vulkan compute wrapper (infrastructure)
- 9-001c: Cosine distance shader (reused for Stage 2)
- 11-002: Maze algorithm design (algorithm definition)
- 11-002a: Dimension-extreme index (CPU reference implementation)
- 11-002b: Similarity-filtered selection (CPU reference implementation)
Enables:
- Faster maze generation in combined GPU sessions
- Pattern for future embedding-space algorithms
Files to Create
libs/vulkan-compute/shaders/dimension_extreme.complibs/vulkan-compute/shaders/batch_cosine_similarity.complibs/vulkan-compute/maze_compute.c(host-side functions)src/maze-gpu-accelerator.lua(Lua interface)
Quality Assurance Criteria
- [ ] GPU results match CPU reference implementation exactly
- [ ] No validation layer errors
- [ ] Handles full 7,793 poem dataset
- [ ] Graceful fallback when GPU unavailable
- [ ] Memory usage stays within 11GB VRAM
Related Issues
- 9-001: Vulkan compute infrastructure (parent)
- 9-001c: Cosine distance shader (reused)
- 9-001d: Diversity sequence GPU (similar pattern)
- 11-002: Maze algorithm design
- 11-002a: Dimension-extreme index (CPU version)
- 11-002b: Similarity-filtered selection (CPU version)
Phase: 9 (GPU Acceleration)
Priority: Low (optimization, not required for MVP)
Created: 2025-12-25
Status: Open
Blocked By: 9-001b, 9-001c, 11-002, 11-002a, 11-002b
Appendix: Dependency Graph
Phase 9 (GPU Infrastructure) Phase 11 (Maze Algorithm)
│ │
▼ ▼
┌───────────────┐ ┌────────────────┐
│ 9-001a: Setup │ │ 11-002: Design │
└───────┬───────┘ └───────┬────────┘
│ │
▼ ▼
┌───────────────┐ ┌─────────────────┐
│ 9-001b: Wrap │ │ 11-002a: Dim-Ex │
└───────┬───────┘ └────────┬────────┘
│ │
▼ ▼
┌───────────────┐ ┌─────────────────┐
│ 9-001c: Cos │ │ 11-002b: Filter │
└───────┬───────┘ └────────┬────────┘
│ │
└──────────────┬───────────────────┘
│
▼
┌─────────────────┐
│ 9-004: GPU Maze │ ← THIS ISSUE
└─────────────────┘