docs/research-2d-embeddings.md
Research Report: Two-Dimensional Embeddings for Poetry Similarity
Date: 2026-01-09
Project: Neocities Modernization - Poetry Similarity System
Research Question: What would it mean to use 2D embeddings instead of high-dimensional embeddings?
Current System Analysis
Existing Configuration
The project currently uses:
- Model:
embeddinggemma:latest(Google's Gemma embedding model) - Dimensionality: 768 dimensions
- Total Poems: 7,797 poems embedded
- File Size: 62 MB for all embeddings
- Endpoint: Ollama local server
How Current Embeddings Work
Each poem is represented as a 768-dimensional vector of floating-point numbers:
[-0.13334751, 0.017504867, 0.00754194, -0.023819013, ...]
Semantic Meaning: Each dimension captures a latent feature learned by the neural network. These might correspond to abstract concepts like:
- Emotional tone (sadness, joy, anger)
- Thematic content (nature, technology, relationships)
- Stylistic features (formal, casual, poetic devices)
- Grammatical structures
- Semantic topics
The high dimensionality allows the model to capture rich, nuanced semantic relationships between texts.
What Are 2D Embeddings?
Definition
Two-dimensional embeddings would represent each poem as just two numbers: [x, y]
Example:
Poem 1: [0.42, -0.13]
Poem 2: [0.15, 0.89]
Poem 3: [-0.67, 0.22]
These would represent the poem's position on a 2D plane that could be directly visualized as:
y
│
● │ ● (Poem 2)
│
─────┼───────── x
│ ●
│ (Poem 3)
(Poem 1)
How They're Created
2D embeddings are typically created through dimensionality reduction techniques:
- PCA (Principal Component Analysis)
- Linear projection that preserves maximum variance
- Fast, deterministic
- Loses non-linear relationships
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Non-linear, preserves local neighborhoods
- Good for visualization
- Slow, non-deterministic, loses global structure
- UMAP (Uniform Manifold Approximation and Projection)
- Preserves both local and global structure better than t-SNE
- Faster than t-SNE
- Better for both visualization and similarity
- Direct Training
- Train a model to output 2D vectors directly
- Requires specific architecture and loss function
- Rarely done due to severe information loss
Comparison: 768D vs 2D
Information Capacity
| Aspect | 768D (Current) | 2D (Proposed) |
|---|---|---|
| Expressiveness | Can encode 768 independent features | Can only encode 2 features |
| Semantic Nuance | Captures subtle thematic distinctions | Very coarse groupings only |
| Similarity Granularity | Fine-grained: "slightly similar" vs "very similar" | Binary: "close" or "far" |
| Information Loss | Minimal (from original text) | Extreme (99.74% fewer dimensions) |
What Gets Lost in 2D
When compressing 768 dimensions to 2, you lose:
- Multi-faceted similarity: A poem can be similar in theme but different in tone
- Hierarchical relationships: Poems about "urban nature" vs "wilderness nature"
- Subtle distinctions: "melancholy" vs "nostalgic sadness"
- Cluster overlap: Poems that belong to multiple semantic categories
- Distance precision: Many poems will have similar distances by pure geometric constraint
Visualization Example
With 768D, similarity relationships might look like (conceptually):
Poem A: [0.1 in "sadness", 0.9 in "urban", 0.3 in "short-form", ... 765 more]
Poem B: [0.9 in "sadness", 0.1 in "nature", 0.8 in "long-form", ... 765 more]
Similarity: 0.42 (moderately similar - both sad, different settings/length)
With 2D, this collapses to:
Poem A: [0.52, 0.31]
Poem B: [0.48, 0.71]
Similarity: 0.28 (based only on 2D distance)
What 2D Would Be Good At
Strengths
- Visualization
- Can display ALL poems on a single scatter plot
- Users see spatial relationships immediately
- Clusters are visually obvious
- Speed
- Distance calculations are 384× faster (2 vs 768 dimensions)
- Sorting by similarity is instantaneous
- Memory usage: 97% reduction (2 floats vs 768 floats per poem)
- Interpretability
- Axes might be human-understandable (e.g., "formal ← → casual" and "sad ← → joyful")
- Users can see their position in "poem space"
- Navigation becomes intuitive spatial movement
- Exploration Interface
- Could build an interactive map where users click regions
- Zoom in/out to explore density
- Draw paths through thematic regions
- Clustering
- Visual clustering is obvious
- Could color-code regions
- Easy to identify "islands" of similar content
Use Cases Where 2D Excels
- Browse mode: "Show me the general landscape of all poems"
- Quick exploration: "What's in this thematic region?"
- Overview visualization: "Where am I in the collection?"
- Teaching: "Here's how poems relate spatially"
What 2D Would Be Bad At
Weaknesses
- Accuracy Loss
- Similar poems in 768D might be far apart in 2D (false negatives)
- Different poems in 768D might be close in 2D (false positives)
- Loss of subtle semantic relationships
- Forced Simplification
- Complex relationships become binary: close or not close
- Multi-dimensional similarity (theme + style + tone) becomes single distance
- Loses ability to be "similar in one way, different in another"
- Projection Artifacts
- Dimensionality reduction creates artificial boundaries
- "Unfolding" of high-D space creates distortions
- Some distances preserved, others completely wrong
- Poor Granularity
- With 7,797 poems, 2D space gets very crowded
- Many poems will have nearly identical 2D positions
- Hard to distinguish between "5th most similar" and "50th most similar"
- Non-Transferable
- 2D projection is specific to this dataset
- New poems can't be easily added (requires re-projection)
- Can't compare to embeddings from other sources
Use Cases Where 2D Fails
- Precise similarity ranking: "Show me the top 100 most similar poems"
- Subtle distinction: "Find poems with similar theme but different tone"
- Cross-modal tasks: "Find poems matching this image/music"
- Dynamic updates: Adding new poems requires full re-projection
Hybrid Approach: Best of Both Worlds
Recommended Architecture
Use both simultaneously:
- 768D for computation
- Keep for all similarity calculations
- Use for generating recommendation lists
- Maintain accuracy and nuance
- 2D for visualization
- Pre-compute using UMAP
- Store as separate fields:
visual_x,visual_y - Use only for display, not computation
Implementation
-- Data structure
{
poem_id = 42,
embedding_768d = [...768 numbers...], -- For computation
visual_2d = {x = 0.42, y = -0.13}, -- For display only
color = "blue", -- Based on 768D cluster
title = "Autumn Memory"
}
Benefits
- Accurate similarity: Use 768D for all recommendations
- Visual exploration: Show 2D scatter plot for browsing
- Color coding: Use 768D to determine semantic clusters, color the 2D viz
- Dual navigation: Click on map (2D) → get precise neighbors (768D)
Practical Example: This Project
Current System (768D Only)
Similarity Calculation:
Poem A: [768 dimensions]
Poem B: [768 dimensions]
Cosine similarity = dot(A, B) / (||A|| * ||B||)
Result: 0.87 (very similar)
UI: Text-only navigation
- "Similar poems" link → list of poem IDs
- No spatial awareness
- No overview of collection structure
With 2D Visualization Added
Homepage: Interactive scatter plot
┌──────────────────────────────┐
│ ● ●│ (Each dot = poem)
│ ● ●● ● ● │
│ ●●● ● ● ●● │ Color = theme
│ ● ● ●● ●● ● ●│
│ ● ● ● ● ●●● ● │ Click = navigate
│ ● ●● ● ● ● ●│
└──────────────────────────────┘
sad ← → joyful
Features:
- Click any poem → see 768D-accurate neighbors
- Hover → poem preview
- Zoom → explore dense regions
- Filter by time → animated timeline
- Search → highlight matching poems in space
Storage Impact
Current: 7,797 poems × 768 floats × 4 bytes = 24 MB
With 2D: 7,797 poems × 770 floats × 4 bytes = 24.1 MB
Additional storage: ~100 KB (negligible!)
Recommendations for This Project
Option 1: Keep 768D Only (Current)
Best for: Text-based exploration, maximum accuracy
Trade-off: No visual overview
Option 2: Add 2D Visualization (Hybrid)
Best for: Visual exploration + accurate similarity
Trade-off: Slight complexity, one-time computation cost
Option 3: 2D Only (Not Recommended)
Best for: Small datasets (<100 items), purely visual browsing
Trade-off: Massive accuracy loss, poor for this use case
My Recommendation: Option 2 (Hybrid)
Why:
- Best of both worlds: accurate computation + visual exploration
- Minimal storage overhead (0.4% increase)
- Enables new interaction paradigms (spatial browsing)
- Helps users understand collection structure
- Can be computed once offline (not part of main pipeline)
Implementation Steps:
- Compute 2D projections using UMAP
- Add
visual_x,visual_yfields toembeddings.json - Generate interactive HTML scatter plot (using pure HTML canvas)
- Link 2D positions to 768D similarity lists
- Add to homepage as exploration interface
Conclusion
768D embeddings are essential for accurate semantic similarity - they capture the rich, multi-faceted nature of poetry meaning.
2D embeddings sacrifice accuracy for intuition - they make relationships visible and explorable but lose 99.74% of the semantic information.
The ideal solution: Use 2D as a visualization layer on top of 768D computation. Think of 2D as a "map" that helps you navigate, while 768D is the actual "distance calculator" that tells you how far apart things really are.
The extra storage cost is negligible (~100 KB), but the UX benefit is enormous: users can finally SEE the structure of the collection, not just navigate it blindly through text links.
Technical Notes
Generating 2D Projections
Using UMAP (recommended):
import umap
import numpy as np
# Load 768D embeddings
embeddings_768d = load_embeddings("embeddings.json") # Shape: (7797, 768)
# Create UMAP projection
reducer = umap.UMAP(
n_components=2,
metric='cosine',
n_neighbors=15,
min_dist=0.1,
random_state=42
)
# Compute 2D positions
embeddings_2d = reducer.fit_transform(embeddings_768d) # Shape: (7797, 2)
# Normalize to [0, 1] range for display
x_norm = (embeddings_2d[:, 0] - embeddings_2d[:, 0].min()) / (embeddings_2d[:, 0].max() - embeddings_2d[:, 0].min())
y_norm = (embeddings_2d[:, 1] - embeddings_2d[:, 1].min()) / (embeddings_2d[:, 1].max() - embeddings_2d[:, 1].min())
# Save to JSON
save_2d_positions(zip(x_norm, y_norm))
Visualization Code
Pure HTML/Canvas (no JavaScript frameworks):
<canvas id="poemMap" width="800" height="600"></canvas>
<script>
const canvas = document.getElementById('poemMap');
const ctx = canvas.getContext('2d');
// Load poem positions
const poems = loadPoemData(); // [{x, y, title, color, id}, ...]
// Draw each poem as a circle
poems.forEach(poem => {
ctx.fillStyle = poem.color;
ctx.beginPath();
ctx.arc(poem.x * 800, poem.y * 600, 3, 0, 2 * Math.PI);
ctx.fill();
});
// Click handler
canvas.onclick = (e) => {
const x = e.offsetX / 800;
const y = e.offsetY / 600;
const clicked = findNearestPoem(x, y);
window.location = `/similar/${clicked.id}.html`;
};
</script>
End of Report
This research demonstrates that while 2D embeddings offer powerful visualization capabilities, they should complement rather than replace high-dimensional embeddings for semantic similarity tasks.