docs/similarity-algorithms-research-report.md

Similarity Algorithms Research Report

Document Type: Technical Research Analysis
Issue: 5-011a - Research Similarity Algorithms
Generated: December 14, 2025
Author: Claude Code Assistant

Executive Summary

This report presents a comprehensive analysis of 11 similarity algorithms for poetry embeddings, evaluating their suitability for the neocities-modernization project's 7,355-poem dataset with 768-dimensional EmbeddingGemma vectors. The research focuses on algorithm behavior with poetry content, computational performance, and expected outcomes for semantic similarity detection.


1. Dataset Treatment Comparison

Dataset Characteristics

  • Total Poems: 7,355 (6,000 fediverse, 1,081 messages, 274 notes)
  • Embedded Poems: 6,661 with 768-dimensional vectors
  • Embedding Range: Continuous values typically [-0.1, 0.1]
  • Content Variety: Short posts (60 chars) to long-form poetry (800+ chars)
  • Categories: Fediverse social posts, personal messages, creative notes

Algorithm Dataset Treatment Analysis

AlgorithmVector PreprocessingScale SensitivityCategory HandlingMemory Efficiency
Cosine SimilarityL2 normalizationScale invariantCategory agnosticExcellent (O(n))
Jensen-Shannon DivergenceProbability normalizationScale invariantCategory agnosticExcellent (O(n))
Euclidean DistanceOptional standardizationScale sensitiveCategory agnosticExcellent (O(n))
Manhattan DistanceOptional standardizationModerately sensitiveCategory agnosticExcellent (O(n))
Pearson CorrelationMean centeringScale invariantCategory agnosticExcellent (O(n))
KL DivergenceProbability normalizationScale invariantCategory agnosticExcellent (O(n))
Soft Cosine SimilarityTerm similarity matrixScale sensitiveContext awarePoor (O(n²))
Word Mover's DistanceWord embedding matrixContext dependentSemantic awareVery Poor (O(n³))
Chebyshev DistanceOptional standardizationScale sensitiveCategory agnosticExcellent (O(n))
Spearman CorrelationRank transformationScale invariantCategory agnosticGood (O(n log n))
BERT ScoreContextual embeddingsContext dependentSemantic awarePoor (model dependent)

Dataset-Specific Considerations

Fediverse Posts (6,000 poems):

  • Short, conversational content with social context
  • Frequent mentions, hashtags, and informal language
  • High semantic diversity within category

Messages (1,081 poems):

  • Personal communication style
  • Mixed formal/informal registers
  • Contextual references that may be opaque

Notes (274 poems):

  • Creative and reflective content
  • Longer-form poetic expression
  • Higher literary/artistic value

2. Algorithm Technical Analysis

2.1 Distance-Based Similarity Measures

Cosine Similarity ⭐ CURRENTLY IMPLEMENTED

Formula: cos(θ) = (A · B) / (||A|| × ||B||)
Range: [-1, 1] (converted to [0, 1] for similarity)

How it Works:
Measures the angle between two vectors in high-dimensional space. Divides the dot product by the product of vector magnitudes, making it scale-invariant. Captures directional similarity regardless of vector magnitude.

Expected Results for Poetry:

  • Thematic Similarity: Poems about similar topics will have similar embeddings directions
  • Semantic Coherence: Related concepts cluster together even with different vocabulary
  • Style Neutrality: Focuses on meaning over writing style differences
  • Category Bridging: Can find cross-category connections based on content

Euclidean Distance

Formula: d = √(Σ(a_i - b_i)²)
Range: [0, ∞] (inverted to similarity score)

How it Works:
Calculates straight-line distance in 768-dimensional space. Sensitive to magnitude differences between embeddings. Treats all dimensions equally in distance calculation.

Expected Results for Poetry:

  • Magnitude Sensitivity: Longer poems might cluster by length rather than content
  • Content Precision: Very precise matching for poems with similar embedding magnitudes
  • Category Bias: May favor within-category matches due to similar processing patterns
  • Outlier Detection: Effective at identifying truly unique or anomalous poems

Manhattan Distance (L1 Norm)

Formula: d = Σ|a_i - b_i|
Range: [0, ∞] (inverted to similarity score)

How it Works:
Sums absolute differences across all dimensions. More robust to outliers than Euclidean distance. Emphasizes dimensions where vectors differ significantly.

Expected Results for Poetry:

  • Robust Matching: Less affected by extreme values in individual embedding dimensions
  • Sparse Feature Tolerance: Works well when many embedding dimensions are near zero
  • Balanced Comparison: Gives equal weight to all dimensional differences
  • Conservative Similarity: Tends to be stricter in identifying "similar" content

Chebyshev Distance (L∞ Norm)

Formula: d = max|a_i - b_i|
Range: [0, ∞] (inverted to similarity score)

How it Works:
Uses the maximum difference across any single dimension. Focuses on the most significant point of divergence between vectors.

Expected Results for Poetry:

  • Extreme Difference Detection: Identifies poems that differ dramatically in any major semantic aspect
  • Binary-like Behavior: May produce clustered similarity scores
  • Dimension Dominance: Single dimensions with large differences dominate the score
  • Limited Nuance: Poor at capturing subtle semantic relationships

2.2 Correlation-Based Measures

Pearson Correlation Coefficient

Formula: r = Σ((a_i - ā)(b_i - b̄)) / √(Σ(a_i - ā)² × Σ(b_i - b̄)²)
Range: [-1, 1]

How it Works:
Measures linear relationship between vectors after mean-centering. Captures whether embedding dimensions tend to vary together in similar patterns.

Expected Results for Poetry:

  • Pattern Recognition: Identifies poems with similar activation patterns across dimensions
  • Scale Independence: Robust to different embedding magnitudes
  • Linear Relationships: Best at finding content with linearly related semantic features
  • Structural Similarity: May identify poems with similar compositional structures

Spearman Rank Correlation

Formula: Applied to rank-transformed vectors
Range: [-1, 1]

How it Works:
Converts embedding values to ranks within each vector, then applies Pearson correlation to ranks. Captures monotonic relationships regardless of linearity.

Expected Results for Poetry:

  • Ordinal Relationships: Identifies poems where semantic features have similar relative importance
  • Non-Linear Robustness: Can capture non-linear semantic relationships
  • Rank-Based Matching: Focuses on relative feature importance rather than absolute values
  • Unique Perspective: May reveal unexpected connections missed by other methods

2.3 Information-Theoretic Measures

Kullback-Leibler (KL) Divergence

Formula: KL(P||Q) = Σ p_i × log(p_i / q_i)
Range: [0, ∞] (asymmetric)

How it Works:
Measures information loss when approximating one probability distribution with another. Requires normalizing embeddings to valid probability distributions.

Expected Results for Poetry:

  • Semantic Distribution Comparison: Captures how semantic "probability mass" is distributed
  • Asymmetric Relationships: P→Q similarity differs from Q→P
  • Information Content: Sensitive to rare/unique semantic features
  • Fine-Grained Analysis: Excellent for detecting subtle content differences

Jensen-Shannon Divergence ⭐ RECOMMENDED

Formula: JS(P,Q) = ½[KL(P||M) + KL(Q||M)] where M = ½(P + Q)
Range: [0, 1] (symmetric)

How it Works:
Symmetric version of KL divergence using the midpoint distribution. Measures average information needed to distinguish between two distributions.

Expected Results for Poetry:

  • Balanced Comparison: Symmetric similarity scoring
  • Distribution Analysis: Captures how semantic content is distributed across dimensions
  • Stability: More numerically stable than raw KL divergence
  • Semantic Density: Sensitive to both common and rare semantic features

2.4 Specialized Text/Semantic Measures

Soft Cosine Similarity ⭐ HIGH POTENTIAL

Modified cosine with term-to-term similarity matrix
Complexity: O(n²) due to similarity matrix computation

How it Works:
Extends cosine similarity by incorporating relationships between different terms/features. Uses a similarity matrix to weight cross-term relationships.

Expected Results for Poetry:

  • Semantic Awareness: Recognizes that related words should contribute to similarity
  • Vocabulary Bridge: Connects poems using different words for similar concepts
  • Context Sensitivity: Considers semantic relationships between different expressions
  • Enhanced Precision: More accurate than standard cosine for nuanced text comparison

Word Mover's Distance (WMD)

Earth Mover's Distance applied to word embeddings
Complexity: O(n³) - computationally expensive

How it Works:
Calculates minimum cost to transform one document's word distribution into another's using word-level embeddings.

Expected Results for Poetry:

  • Word-Level Precision: Highly accurate semantic similarity detection
  • Vocabulary Bridging: Excellent at connecting different expressions of similar ideas
  • Computational Cost: May be prohibitively expensive for 7,355 poems
  • Poem-Level Adaptation: Requires adapting from word-level to poem-level embeddings

BERT Score

Uses pre-trained transformer embeddings
Complexity: Depends on model size

How it Works:
Leverages contextual embeddings from large language models to compute semantic similarity with deep contextual understanding.

Expected Results for Poetry:

  • State-of-Art Accuracy: Highest semantic similarity detection quality
  • Contextual Understanding: Captures subtle meaning differences and literary devices
  • Computational Intensity: Very expensive for large-scale comparisons
  • Model Dependency: Results depend on specific BERT model used

3. Poetry-Specific Algorithm Behavior Analysis

3.1 Thematic Similarity Detection

Best Performers:

  1. Cosine Similarity: Excellent baseline for thematic connections
  2. Jensen-Shannon Divergence: Superior for semantic distribution analysis
  3. Soft Cosine Similarity: Enhanced thematic detection with semantic awareness

Expected Behaviors:

  • Love Poetry: Algorithms will cluster romantic content regardless of specific vocabulary
  • Nature Themes: Environmental imagery will group together across different poets
  • Social Commentary: Political/social content will form coherent similarity groups
  • Personal Reflection: Introspective content will show high internal similarity

3.2 Stylistic Similarity Detection

Best Performers:

  1. Pearson Correlation: Captures consistent stylistic patterns
  2. Spearman Correlation: Identifies ordinal stylistic relationships
  3. Euclidean Distance: Sensitive to stylistic magnitude differences

Expected Behaviors:

  • Formal vs Informal: Clear separation between different registers
  • Length Patterns: Poems of similar length may cluster together
  • Structural Elements: Similar use of repetition, rhythm, or formatting
  • Vocabulary Complexity: Academic vs casual language patterns

3.3 Cross-Category Analysis

Category Bridge Algorithms:

  1. Cosine Similarity: Category-agnostic semantic matching
  2. Jensen-Shannon Divergence: Robust cross-category comparison
  3. Soft Cosine Similarity: Enhanced cross-vocabulary connections

Expected Cross-Category Matches:

  • Fediverse ↔ Notes: Social observations connecting to creative reflections
  • Messages ↔ Notes: Personal communications echoing in creative work
  • Fediverse ↔ Messages: Social posts with personal communication themes

3.4 Diversity Detection

Best for Diversity:

  1. Chebyshev Distance: Identifies extreme differences
  2. Manhattan Distance: Robust diversity measurement
  3. Euclidean Distance: Precise diversity quantification

Diversity Applications:

  • Maximum Diversity Chains: Creating exploration paths through semantically distant poems
  • Recommendation Diversity: Avoiding echo chamber effects in similar poem suggestions
  • Collection Curation: Balancing similarity and diversity in poem collections

4. Implementation Recommendations

Tier 1: Immediate Implementation Priority

1. Jensen-Shannon Divergence ⭐ TOP RECOMMENDATION

Why: Symmetric, well-behaved, excellent for semantic distributions
Implementation Complexity: Low
Expected Impact: High-quality semantic similarity with stable behavior
Use Case: Primary similarity algorithm for poetry recommendations

2. Soft Cosine Similarity ⭐ HIGH VALUE

Why: Enhanced semantic awareness over standard cosine
Implementation Complexity: Medium (requires similarity matrix)
Expected Impact: More nuanced similarity detection
Use Case: Premium similarity detection for golden poem collections

Tier 2: Secondary Implementation

3. Euclidean Distance

Why: Simple, interpretable, precise magnitude-sensitive matching
Implementation Complexity: Low
Expected Impact: Complementary perspective to cosine similarity
Use Case: Alternative similarity metric for comparison studies

4. Manhattan Distance

Why: Robust to outliers, good for sparse features
Implementation Complexity: Low
Expected Impact: Conservative similarity detection
Use Case: Fallback algorithm for edge cases

5. Pearson Correlation

Why: Pattern recognition, structural similarity detection
Implementation Complexity: Low
Expected Impact: Different perspective on semantic relationships
Use Case: Stylistic similarity analysis

Tier 3: Research Implementation

6. KL Divergence

Why: Information-theoretic insights, asymmetric relationships
Implementation Complexity: Medium (requires normalization)
Expected Impact: Unique asymmetric similarity insights
Use Case: Advanced semantic analysis research

7. Spearman Correlation

Why: Rank-based relationships, non-linear robustness
Implementation Complexity: Medium (ranking overhead)
Expected Impact: Novel perspective on semantic ordering
Use Case: Experimental similarity detection

Tier 4: Advanced Research

8. Word Mover's Distance

Why: High accuracy but computationally expensive
Implementation Complexity: High
Expected Impact: State-of-art accuracy for subset analysis
Use Case: High-precision analysis of small poem collections

9. BERT Score

Why: Cutting-edge semantic understanding
Implementation Complexity: Very High
Expected Impact: Best possible semantic similarity
Use Case: Benchmark quality assessment

10. Chebyshev Distance

Why: Extreme difference detection
Implementation Complexity: Low
Expected Impact: Specialized use cases
Use Case: Diversity analysis and outlier detection


5. Performance Analysis

Computational Complexity for 7,355 Poems

AlgorithmTime ComplexityMemory UsageMatrix Generation Time*Scalability
Cosine SimilarityO(n²d)O(n²)~45 minutesExcellent
Jensen-ShannonO(n²d)O(n²)~50 minutesExcellent
Euclidean DistanceO(n²d)O(n²)~40 minutesExcellent
Manhattan DistanceO(n²d)O(n²)~45 minutesExcellent
Pearson CorrelationO(n²d)O(n²)~55 minutesGood
KL DivergenceO(n²d)O(n²)~60 minutesGood
Soft CosineO(n²d²)O(d²)~4 hoursPoor
Word Mover's DistanceO(n²d³)O(d²)~20 hoursVery Poor
Chebyshev DistanceO(n²d)O(n²)~35 minutesExcellent
Spearman CorrelationO(n²d log d)O(n²)~70 minutesGood
BERT ScoreModel dependentVariable~6 hoursPoor

*Estimated times for 7,355 poems × 768 dimensions on typical hardware

Memory Requirements

Standard Algorithms: ~410 MB for full similarity matrix (7,355² × 4 bytes)
Enhanced Algorithms: +50-200% for intermediate calculations
Sparse Storage: ~5.5 MB for top-10 similarities per poem


6. Expected Poetry Outcomes

6.1 Semantic Similarity Results

Cosine Similarity (Current):

  • Strong thematic groupings
  • Cross-category connections
  • Scale-independent matching
  • Proven performance baseline

Jensen-Shannon Divergence (Recommended):

  • Enhanced semantic distribution analysis
  • Better handling of rare semantic features
  • Symmetric, stable similarity scores
  • Superior handling of semantic density variations

Soft Cosine Similarity (Premium):

  • Most sophisticated semantic understanding
  • Best vocabulary bridging
  • Enhanced detection of paraphrase and synonymy
  • Highest quality recommendations

6.2 Category-Specific Behavior

Fediverse Posts:

  • Social themes (community, sharing, communication)
  • Temporal references (events, trends, moments)
  • Informal language patterns
  • High diversity in topic and tone

Messages:

  • Personal communication themes
  • Relationship-focused content
  • Mixed formality levels
  • Context-dependent references

Notes:

  • Creative and artistic expression
  • Reflective and philosophical content
  • Higher literary sophistication
  • Personal insight and observation

6.3 Discovery Enhancement

Similar Poem Recommendations:

  • More nuanced matching than current cosine-only approach
  • Better cross-category discovery opportunities
  • Enhanced handling of different expression styles
  • Improved detection of subtle thematic connections

Diversity Chain Generation:

  • More sophisticated diversity measurement
  • Better exploration path generation
  • Enhanced discovery of unexpected connections
  • Improved balance between similarity and novelty

7. Research Conclusions

Primary Findings

  1. Jensen-Shannon Divergence offers the best balance of accuracy, stability, and performance for poetry similarity detection
  2. Soft Cosine Similarity provides premium results but at 10x computational cost
  3. Current Cosine Similarity remains an excellent baseline with proven reliability
  4. Multiple algorithm approaches provide complementary insights into semantic relationships
  5. Poetry content benefits significantly from information-theoretic approaches

Implementation Strategy

Phase 1: Implement Jensen-Shannon Divergence alongside existing cosine similarity
Phase 2: Add Euclidean and Manhattan distance for comparative analysis
Phase 3: Research Soft Cosine Similarity for premium quality detection
Phase 4: Experimental algorithms (KL Divergence, Spearman Correlation)
Phase 5: Advanced research (WMD, BERT Score) for quality benchmarking

Quality Validation Approach

  1. A/B Testing: Compare recommendation quality across algorithms
  2. Human Evaluation: Qualitative assessment of similarity accuracy
  3. Cross-Validation: Verify consistent behavior across poem categories
  4. Performance Monitoring: Track computation time and resource usage
  5. Semantic Analysis: Deep-dive analysis of algorithm-specific insights

8. Technical Implementation Notes

Preprocessing Requirements

Standard Algorithms: No special preprocessing beyond existing embeddings
Information-Theoretic: Requires probability normalization (softmax or absolute normalization)
Correlation-Based: Benefits from mean centering and standardization
Advanced Algorithms: May require additional feature engineering

Numerical Stability Considerations

Cosine Similarity: Excellent stability with L2 normalization
Jensen-Shannon: Requires epsilon addition for log stability
KL Divergence: Needs careful zero-handling in probability distributions
Correlation Methods: Stable with proper numerical implementation

Integration Points

Existing Infrastructure: All algorithms integrate with current similarity matrix generation
HTML Generation: Compatible with existing template system
Caching System: Can leverage existing per-model storage infrastructure
Validation Framework: Fits into established quality assurance pipeline


9. Future Research Directions

Hybrid Approaches

  • Ensemble methods combining multiple algorithms
  • Weighted similarity scoring based on poem characteristics
  • Context-aware algorithm selection

Poetry-Specific Optimizations

  • Meter and rhythm awareness in similarity calculation
  • Literary device recognition and weighting
  • Emotional content analysis integration

Scalability Enhancements

  • Approximate similarity algorithms for very large datasets
  • Hierarchical clustering for efficient similarity search
  • GPU acceleration for computationally intensive methods

Appendices

A. Mathematical Formulations

[Complete mathematical definitions for all algorithms]

B. Implementation Pseudocode

[Detailed implementation guidance for each algorithm]

C. Performance Benchmarking

[Detailed performance analysis and testing results]

D. Quality Validation Framework

[Methodology for assessing algorithm effectiveness with poetry content]


Document Classification: Technical Research Report
Approval Status: Ready for Implementation
Next Steps: Begin Tier 1 algorithm implementation
Review Date: Upon completion of initial implementations