docs/effil-usage-patterns.md
effil Library: Usage Patterns and Antipatterns
Overview
effil is a multi-threading library for LuaJIT that enables parallel computation through thread spawning and shared data structures. However, it must be used correctly to avoid catastrophic performance degradation.
The Critical Pattern: Copy effil.table to Local
RULE: Always copy effil.table() to a local Lua table at worker thread start.
-- CORRECT: Copy at worker start (used once)
local function worker_thread(shared_data)
-- Copy effil.table to local table (ONE TIME)
local local_data = {}
for k, v in pairs(shared_data) do
local_data[k] = v
end
-- Use local_data for all subsequent access (NO SYNCHRONIZATION)
for i = 1, 1000 do
local value = local_data[i] -- Fast local access
-- ... do work ...
end
end
-- WRONG: Access effil.table directly (synchronization on every access)
local function bad_worker_thread(shared_data)
for i = 1, 1000 do
local value = shared_data[i] -- Synchronization overhead! SLOW!
end
end
Use Cases
✅ GOOD: Immutable Data Passed to Independent Workers
Pattern: Producer creates data → pass to worker → worker processes independently
Examples:
- File Writing (Issue 9-002c)
```lua
for i = 1, num_files do
local data = generate_data(i)
local shared = effil.table(data) -- Immutable copy
spawn_writer(shared) -- Worker writes file independently
end
```
- GPU computes similarities → pass to writer → writer writes JSON
- Each file is independent (no shared state between workers)
- Proven in
generate-html-parallel
- HTML Generation (
scripts/generate-html-parallel)
```lua
local shared_poems = effil.table(all_poems)
local shared_colors = effil.table(poem_colors)
for poem in poems do
spawn_worker(shared_poems, shared_colors, poem)
end
```
- Worker copies effil.table to local at start (lines 213, 342, 556)
- Worker generates HTML using local copies (fast)
- No synchronization during HTML generation
Why it works:
- Data passed once, copied once
- Workers operate independently
- No contention, no synchronization overhead
❌ BAD: Constantly Mutating Shared State
Pattern: Multiple threads update shared data structure continuously
Example: Diversity cache pre-computation (Issue 9-001f)
-- BAD: Shared centroid updated 7,796 times per sequence
local shared_centroid = effil.table()
local shared_mask = effil.table()
for i = 1, 7797 do -- Per sequence
for iteration = 1, 7796 do -- Per iteration
-- Update shared centroid (SYNCHRONIZATION!)
shared_centroid[dim] = new_value
-- Update shared mask (SYNCHRONIZATION!)
shared_mask[poem_idx] = 0
end
end
-- Result: 60 million updates × synchronization overhead = ~17 billion ops
-- Time: 42+ hours (vs GPU: 58 seconds)
Why it fails:
- effil.table access requires inter-thread synchronization
- Each read/write locks the table
- 60 million updates = catastrophic overhead
- CPU-bound synchronization defeats parallelism
Solution: Move state to GPU (Issue 9-001g)
-- GPU maintains centroids, masks internally (no CPU synchronization)
vkd_batch_compute(embeddings, num_poems) -- All state on GPU
-- Result: 58 seconds (996× speedup)
Performance Characteristics
effil.table() Access Cost
| Operation | Cost | When to Use |
|---|---|---|
| Create effil.table(data) | ~100 μs | Once per worker spawn |
| Copy to local table | ~50 μs per 1,000 elements | Once at worker start |
| Access effil.table directly | ~1-10 μs per access | NEVER (use local copy) |
| Access local table | ~10 ns per access | Always (after copy) |
Speedup from copying: 100-1,000× faster access after one-time copy.
Threading Overhead
| Operation | Cost | Notes |
|---|---|---|
| Spawn effil.thread() | ~500 μs | Use thread pool to amortize |
| thread:get() blocking | ~10 μs | Minimal if worker is done |
| thread:get(0) non-blocking | ~1 μs | Use for polling |
| effil.sleep(0.001) | 1 ms | Use when polling for completed threads |
Decision Tree: Should I Use effil?
Is this a multi-threaded task?
├─ NO → Use single-threaded code
└─ YES → Continue
Do threads need to share mutable state?
├─ YES → ❌ DON'T use effil
│ → Move computation to GPU (if vector ops)
│ → Use process-based parallelism (if I/O bound)
│ → Use message passing (if small updates)
└─ NO → Continue
Are threads doing independent work on immutable data?
├─ YES → ✅ USE effil
│ → Pass data via effil.table()
│ → Copy to local at worker start
│ → Workers operate independently
└─ NO → Reconsider architecture
Common Patterns
Pattern 1: Thread Pool for File I/O
local active_threads = {}
local max_concurrent = 8
for i = 1, num_tasks do
local task_data = prepare_task(i)
local shared_data = effil.table(task_data)
-- Wait for slot if pool is full
while #active_threads >= max_concurrent do
reap_completed_threads(active_threads)
if #active_threads >= max_concurrent then
effil.sleep(0.001)
end
end
-- Spawn worker
table.insert(active_threads, effil.thread(worker)(shared_data))
end
-- Wait for all
for _, thread in ipairs(active_threads) do
thread:get()
end
Pattern 2: Batch Processing with Workers
local BATCH_SIZE = 100
for batch_start = 1, num_items, BATCH_SIZE do
local batch_data = prepare_batch(batch_start, BATCH_SIZE)
local shared_batch = effil.table(batch_data)
-- Spawn workers for batch
local workers = {}
for i = 1, 8 do
workers[i] = effil.thread(batch_worker)(shared_batch, i, 8)
end
-- Wait for batch completion
for _, worker in ipairs(workers) do
worker:get()
end
end
Antipatterns
❌ Antipattern 1: Direct effil.table Access in Loop
-- BAD: Synchronization on every iteration
local shared_data = effil.table(large_array)
local function bad_worker(shared)
for i = 1, 10000 do
local val = shared[i] -- SLOW! (10,000 synchronizations)
end
end
-- GOOD: Copy once, access locally
local function good_worker(shared)
local local_data = {}
for k, v in pairs(shared) do local_data[k] = v end
for i = 1, 10000 do
local val = local_data[i] -- FAST! (no synchronization)
end
end
❌ Antipattern 2: Shared Accumulator
-- BAD: Multiple threads updating shared counter
local shared_sum = effil.table({total = 0})
local function bad_worker(shared, start, stop)
for i = start, stop do
shared.total = shared.total + compute(i) -- RACE CONDITION!
end
end
-- GOOD: Each worker computes partial, main thread sums
local function good_worker(start, stop)
local partial_sum = 0
for i = start, stop do
partial_sum = partial_sum + compute(i)
end
return partial_sum -- Return via thread:get()
end
-- Main thread
local total = 0
for _, thread in ipairs(threads) do
local partial = thread:get()
total = total + partial -- Single-threaded aggregation
end
❌ Antipattern 3: Shared Work Queue
-- BAD: Workers pulling from shared queue (contention)
local shared_queue = effil.table({1, 2, 3, ..., 1000})
local function bad_worker(queue)
while #queue > 0 do
local item = table.remove(queue, 1) -- Lock contention!
process(item)
end
end
-- GOOD: Pre-assign work to workers (no sharing)
local function good_worker(items)
local local_items = {}
for k, v in pairs(items) do local_items[k] = v end
for _, item in ipairs(local_items) do
process(item)
end
end
-- Partition work upfront
for i = 1, 8 do
local worker_items = get_partition(all_items, i, 8)
local shared_items = effil.table(worker_items)
threads[i] = effil.thread(good_worker)(shared_items)
end
Related Issues
- Issue 9-001f: Remove effil dependency (may be obsolete)
- Issue 9-001g: Batch parallel diversity (why effil failed, GPU succeeded)
- Issue 9-002c: Parallelize file writing (correct effil use case)
- Issue 8-002: Multi-threaded HTML generation (effil success story)
Conclusion
effil is a useful tool when used correctly:
- ✅ Immutable data passed to independent workers
- ✅ One-time copy to local at worker start
- ✅ No shared mutable state
effil is catastrophically slow when misused:
- ❌ Constantly mutating shared state
- ❌ Direct access to effil.table in loops
- ❌ Lock contention between threads
General principle: If threads need to share mutable state, effil is the wrong tool. Use GPU computation, process-based parallelism, or message passing instead.