docs/research-tuple-embeddings.md
Research Report: Tuple-Based Embeddings (768 × 2D Structure)
Date: 2026-01-09
Project: Neocities Modernization - Poetry Similarity System
Research Question: What would it mean to use 768-dimensional embeddings where each dimension is a 2D tuple instead of a scalar?
Clarification: What We're Actually Asking
This report addresses a novel embedding structure where:
- Instead of 768 scalar values:
[0.42, -0.13, 0.89, ...] - We have 768 two-dimensional tuples:
[(0.42, -0.13), (0.89, 0.34), (-0.21, 0.67), ...]
This is NOT about dimensionality reduction (768D → 2D). This is about changing the fundamental data type of each dimension from a scalar to a coordinate pair.
Current System vs Tuple-Based System
Current Scalar Embeddings
{
"id": "poem-0042",
"embedding": [
-0.13334751, // Dimension 0: scalar intensity
0.017504867, // Dimension 1: scalar intensity
0.00754194, // Dimension 2: scalar intensity
... // 765 more scalar values
]
}
Properties:
- 768 numbers per poem
- Each dimension = single intensity value
- Total storage: 768 × 4 bytes = 3 KB per poem
Proposed Tuple Embeddings
{
"id": "poem-0042",
"embedding": [
[-0.13334751, 0.017504867], // Dimension 0: (x, y) coordinate
[0.00754194, -0.023819013], // Dimension 1: (x, y) coordinate
[0.11234567, 0.98765432], // Dimension 2: (x, y) coordinate
... // 765 more (x, y) pairs
]
}
Properties:
- 768 coordinate pairs per poem
- Each dimension = 2D point in its own feature space
- Total storage: 768 × 2 × 4 bytes = 6 KB per poem (2× current)
What Does Each Tuple Dimension Represent?
Conceptual Interpretations
Each (x, y) tuple in a dimension could represent:
1. Magnitude + Direction (Polar-Like Representation)
Dimension 0: "Sadness Feature"
x = intensity of sadness (how sad?)
y = directionality of sadness (melancholy vs grief vs nostalgia?)
Dimension 1: "Urban Theme"
x = urban vs rural scale
y = modern vs historical scale
Example:
Poem A: [(0.9, 0.1), ...] → Very sad, leaning melancholy
Poem B: [(0.9, 0.9), ...] → Very sad, leaning grief
Poem C: [(0.1, 0.5), ...] → Barely sad, neutral direction
2. Positive + Negative Axes (Bipolar Features)
Dimension 0: "Formality Spectrum"
x = formality (casual ← → formal)
y = technicality (simple ← → complex)
Dimension 1: "Emotional Valence"
x = positive emotions
y = negative emotions
Example:
Poem A: [(0.8, 0.2), ...] → Formal language, simple concepts
Poem B: [(-0.5, 0.9), ...] → Casual language, complex ideas
3. Independent Sub-Features
Dimension 0: "Nature Theme"
x = flora presence (plants, trees, flowers)
y = fauna presence (animals, insects, birds)
Dimension 1: "Time References"
x = past-oriented language
y = future-oriented language
4. Spatial Encoding in Feature Space
Each dimension represents a position in a learned 2D "concept space" where semantic relationships are encoded spatially. Similar concepts cluster together in the 2D subspace.
Mathematical Properties
Distance Metrics
With tuple-based embeddings, we need to define how to measure distance:
Option 1: Element-Wise 2D Distance, Then Sum
function tuple_cosine_similarity(embedding_a, embedding_b)
local dot_product = 0
local norm_a = 0
local norm_b = 0
for i = 1, 768 do
-- Treat each tuple as a 2D vector in dimension i
local ax, ay = embedding_a[i][1], embedding_a[i][2]
local bx, by = embedding_b[i][1], embedding_b[i][2]
-- 2D dot product for this dimension
local dim_dot = ax * bx + ay * by
dot_product = dot_product + dim_dot
-- 2D magnitude for this dimension
norm_a = norm_a + (ax * ax + ay * ay)
norm_b = norm_b + (bx * bx + by * by)
end
return dot_product / (math.sqrt(norm_a) * math.sqrt(norm_b))
end
This treats the embedding as 1536 flat dimensions (768 × 2).
Option 2: Per-Dimension 2D Similarity, Then Average
function tuple_dimension_aware_similarity(embedding_a, embedding_b)
local total_similarity = 0
for i = 1, 768 do
local ax, ay = embedding_a[i][1], embedding_a[i][2]
local bx, by = embedding_b[i][1], embedding_b[i][2]
-- Cosine similarity within this 2D dimension
local dot = ax * bx + ay * by
local norm_a = math.sqrt(ax * ax + ay * ay)
local norm_b = math.sqrt(bx * bx + by * by)
local dim_similarity = dot / (norm_a * norm_b + 1e-8)
total_similarity = total_similarity + dim_similarity
end
return total_similarity / 768
end
This treats each dimension as its own 2D similarity calculation.
Option 3: Euclidean Distance Per Dimension
function tuple_euclidean_distance(embedding_a, embedding_b)
local total_distance = 0
for i = 1, 768 do
local ax, ay = embedding_a[i][1], embedding_a[i][2]
local bx, by = embedding_b[i][1], embedding_b[i][2]
-- 2D Euclidean distance in this dimension
local dx = ax - bx
local dy = ay - by
local dim_distance = math.sqrt(dx * dx + dy * dy)
total_distance = total_distance + dim_distance
end
return total_distance / 768 -- Average distance per dimension
end
Advantages of Tuple-Based Embeddings
1. Richer Feature Representation
- Each semantic feature can encode TWO aspects instead of one
- Example: "sadness" can have both intensity AND type
- More nuanced similarity: poems can be "similarly sad but differently sad"
2. Natural Multi-Aspect Encoding
- Separates complementary properties within each feature
- Could capture contradictions: high joy + high sadness simultaneously
- Allows for "orthogonal" semantic properties in same dimension
3. Polar/Angular Relationships
If dimensions encode magnitude + direction:
- Can find poems with "same energy, different direction"
- Can find poems with "same theme, different approach"
- Angular distance captures stylistic variation
4. Dimension-Specific Visualization
- Each of 768 dimensions can be plotted as scatter plot
- See how poems cluster in "Sadness Space" (dimension 42)
- Identify outliers in specific semantic dimensions
5. Flexible Similarity Metrics
- Can weight magnitude vs direction differently
- Can emphasize certain dimensions over others
- Can compute "structural similarity" (angles) separate from "intensity similarity" (magnitudes)
Disadvantages and Challenges
1. Double Storage Requirements
Current: 7,797 poems × 768 floats × 4 bytes = 24 MB
Tuple: 7,797 poems × 1,536 floats × 4 bytes = 48 MB
Increase: 24 MB (100% growth)
For this project, storage is not an issue (48 MB is trivial). But at scale this matters.
2. Ambiguous Semantic Interpretation
- What does the x-axis vs y-axis of dimension 347 mean?
- Neural networks would learn these automatically, but interpretation is harder
- May not align with human intuitions
3. No Existing Pre-Trained Models
- Current embedding models (Gemma, BERT, etc.) output scalar dimensions
- Would need to train custom model from scratch
- Or: artificially split existing dimensions into pairs (loses meaning)
4. Similarity Metric Uncertainty
- Which distance formula is "correct"?
- Different metrics give different results
- Need empirical testing to validate
5. Computational Cost
- 2× floating-point operations for distance calculations
- More complex similarity logic
- Potentially slower indexing/search
6. Harder to Integrate with Existing Tools
- Most vector databases expect flat scalar vectors
- Visualization tools assume 1D per dimension
- Would need custom infrastructure
How Would You Create Tuple Embeddings?
Option 1: Train Custom Model
Architecture: Modify transformer output layer to produce 2D vectors per dimension
class TupleEmbeddingModel(nn.Module):
def __init__(self):
self.transformer = TransformerEncoder()
# Output: 768 dimensions × 2 values each
self.output_layer = nn.Linear(hidden_size, 768 * 2)
def forward(self, text):
hidden = self.transformer(text)
flat_output = self.output_layer(hidden) # Shape: (1536,)
# Reshape to (768, 2)
return flat_output.reshape(768, 2)
Training: Contrastive learning with tuple-aware loss function
Pros: Learns meaningful 2D structure
Cons: Requires large dataset and compute
Option 2: Split Existing Embeddings
Naive Approach: Take 768D embedding, reshape to (384, 2)
embedding_768d = model.encode(poem) # [768] scalars
embedding_384tuples = embedding_768d.reshape(384, 2) # [(x,y)] × 384
Pros: Uses existing models
Cons:
- Reduces from 768 to 384 dimensions
- No semantic meaning to the pairing
- Arbitrary split (dimension 0 + 1 may not be related)
Option 3: Learned Projection from Scalar to Tuple
Approach: Train adapter layer that converts scalar to tuple
# Start with existing 768D scalar embeddings
scalar_embedding = gemma_model.encode(poem) # [768] scalars
# Train small network to project each scalar to (x, y)
tuple_embedding = []
for i, scalar in enumerate(scalar_embedding):
x, y = projection_network(scalar, dimension_id=i)
tuple_embedding.append([x, y])
Training: Use similarity-preserving loss (maintain relative distances)
Pros: Preserves all 768 dimensions, builds on existing models
Cons: Still requires training; unclear if useful
Option 4: Engineered Semantic Tuples
Manual Design: Create 768 hand-crafted 2D feature spaces
# Dimension 0: Emotional Valence
x = positive_emotion_score(poem) # 0-1
y = negative_emotion_score(poem) # 0-1
# Dimension 1: Formality
x = formal_language_ratio(poem) # 0-1
y = vocabulary_complexity(poem) # 0-1
# ... 766 more hand-crafted features
Pros: Fully interpretable
Cons: Requires enormous manual effort, likely worse than learned features
Use Cases Where Tuple Embeddings Excel
1. Multi-Aspect Similarity Search
Query: "Find poems that are similarly sad but expressed differently"
- Compare magnitude (intensity of sadness): must be similar
- Compare angles (expression of sadness): must be different
2. Feature Space Exploration
Visualize all poems in "Love Dimension Space":
x-axis: romantic love
y-axis: familial love
See which poems occupy which quadrants
3. Contradiction Detection
Dimension 42: (joy, sadness)
Poems with high values in BOTH coordinates are emotionally complex
Find poems with (0.9, 0.9) → "bittersweet" or "nostalgic"
4. Directional Similarity
Find poems "moving in the same semantic direction":
Calculate angular similarity between vectors
Group poems by trajectory in feature space
Use Cases Where Scalar Embeddings Are Better
1. Standard Similarity Ranking
- Scalar embeddings are simpler and well-understood
- Proven to work for semantic similarity
- No ambiguity in distance calculation
2. Integration with Existing Tools
- Vector databases (FAISS, Annoy, etc.) expect flat vectors
- Pre-trained models output scalars
- Established best practices
3. Interpretability of Results
- "These poems have cosine similarity 0.87" is clear
- With tuples: "Is that 0.87 in magnitude? Angle? Combined?"
4. Computational Efficiency
- Half the storage, half the computation
- Faster indexing and retrieval
Hybrid Approach: Structured Tuple Interpretation
Idea: Keep scalar embeddings computationally, but interpret them as tuples conceptually
-- Store as flat 1536-dimensional vector for computation
embedding = [x1, y1, x2, y2, x3, y3, ..., x768, y768]
-- Organize into 768 tuples for visualization/analysis
function get_dimension_tuple(embedding, dim_index)
local i = (dim_index - 1) * 2
return embedding[i + 1], embedding[i + 2]
end
-- Compute similarity on flat vector (standard cosine)
similarity = cosine_similarity_flat(embedding_a, embedding_b)
-- Visualize specific dimensions as 2D spaces
plot_dimension(poems, dimension=42) -- Shows all poems in this 2D feature
Benefits:
- Compatible with existing infrastructure
- Can train as 1536D model
- Interpret as 768 tuples for analysis
- Best of both worlds
Practical Implications for This Project
Current System
- Model:
embeddinggemma:latest(768D scalar) - Size: 7,797 poems × 3 KB = 24 MB
- Similarity: Standard cosine similarity
If Switching to Tuple Embeddings
Storage Impact
Current: 24 MB embeddings
Tuple: 48 MB embeddings
Increase: +24 MB (still trivial for this project)
Generation Time Impact
Must re-embed all 7,797 poems with new model
Current embedding: ~1-2 seconds per poem
Total: ~2-4 hours for full re-generation
Similarity Calculation Changes
-- Current (flat-html-generator.lua:1156)
function calculate_similarity(poem_a, poem_b)
return cosine_similarity(poem_a.embedding, poem_b.embedding)
end
-- With tuples (new version)
function calculate_similarity_tuple(poem_a, poem_b)
-- Need to decide on metric (see distance options above)
return tuple_aware_similarity(poem_a.embedding, poem_b.embedding)
end
Website Generation Impact
No change to HTML generation
Changes only affect:
1. Embedding generation (src/ollama-embedder.lua)
2. Similarity calculation (src/flat-html-generator.lua)
3. Possibly similarity matrix format
Experimental Path Forward
If you want to explore tuple embeddings:
Phase 1: Synthetic Test (1-2 hours)
- Take 100 poems from existing dataset
- Artificially create tuple embeddings (reshape 768 → 384×2)
- Implement multiple distance metrics
- Compare similarity rankings against scalar baseline
- See if any interpretation emerges
Phase 2: Learned Projection (1-2 days)
- Train small network to project scalar → tuple
- Use similarity-preserving loss
- Re-compute similarities
- Measure quality (precision/recall on human similarity judgments)
Phase 3: Custom Model (1-2 weeks)
- Fine-tune transformer to output tuples directly
- Train on poetry similarity task
- Evaluate against scalar baseline
- Test interpretability of learned 2D spaces
Phase 4: Integration (2-3 days)
- Update embedding generator
- Update similarity calculator
- Regenerate all embeddings
- Rebuild website
Recommendations
For This Project (Neocities Poetry)
Recommendation: Keep scalar embeddings for now, but consider tuple structure for future experiments
Why:
- Proven Quality: Scalar embeddings work well for semantic similarity
- No Existing Models: Would need to train custom tuple model from scratch
- Unclear Benefit: No evidence tuple structure improves poetry similarity
- Integration Cost: Would require significant refactoring
- Interpretability: Scalar dimensions are already hard to interpret; tuples are harder
When Tuple Embeddings Make Sense:
- After gathering user feedback on similarity quality
- If you identify specific multi-aspect features (e.g., "sad but hopeful")
- If you want to build interactive dimension-specific exploration tools
- If you're willing to train a custom model
For Research/Exploration
Recommendation: Try synthetic tuple experiment (Phase 1 above)
Why:
- Low cost (1-2 hours)
- No model training required
- Can validate if tuple structure offers anything
- If promising → proceed to Phase 2
- If not → abandon with minimal cost
For Future Work
Potential Application: Dimension-specific visualization
Even with scalar embeddings, you could:
- Take pairs of dimensions (e.g., dim 42 & 43)
- Plot poems in 2D space defined by those dimensions
- Create 384 such plots (768 dimensions = 384 pairs)
- Find which dimension-pairs show interesting clusters
- Interpret what those dimensions represent
This gives you tuple-like visualization WITHOUT changing the embedding structure.
Conclusion
Tuple-based embeddings (768 dimensions × 2D coordinates) represent a novel and largely unexplored embedding architecture. They offer:
Potential Advantages:
- Richer feature representation (2 aspects per dimension)
- Natural encoding of multi-faceted properties
- Dimension-specific 2D exploration
- Flexible similarity metrics
Significant Challenges:
- No existing pre-trained models
- Ambiguous semantic interpretation
- Unclear similarity metrics
- 2× storage requirements
- Limited proven benefits
For this project: The scalar 768D embeddings are sufficient for accurate poetry similarity. Tuple embeddings are an interesting research direction but not necessary for the current goals.
Recommendation: Keep current scalar embeddings. Consider exploring tuple structure if:
- You want to experiment with novel embedding architectures
- You identify specific multi-aspect features that would benefit from 2D representation
- You're willing to train custom models and validate results
The cost/benefit ratio currently favors scalar embeddings, but tuple embeddings remain an intriguing avenue for future research.
End of Report
This research demonstrates that while tuple-based embeddings offer theoretical advantages for multi-aspect semantic representation, their practical benefits remain uncertain without empirical validation. The current scalar embedding system is well-suited for this project's needs.