docs/similarity-algorithms-research-report.md

Similarity Algorithms Research Report

Document Type: Technical Research Analysis
Issue: 5-011a - Research Similarity Algorithms
Generated: December 14, 2025
Author: Claude Code Assistant

Executive Summary

This report presents a comprehensive analysis of 11 similarity algorithms for poetry embeddings, evaluating their suitability for the neocities-modernization project's 7,355-poem dataset with 768-dimensional EmbeddingGemma vectors. The research focuses on algorithm behavior with poetry content, computational performance, and expected outcomes for semantic similarity detection.

1. Dataset Treatment Comparison

Dataset Characteristics

Total Poems: 7,355 (6,000 fediverse, 1,081 messages, 274 notes)
Embedded Poems: 6,661 with 768-dimensional vectors
Embedding Range: Continuous values typically [-0.1, 0.1]
Content Variety: Short posts (60 chars) to long-form poetry (800+ chars)
Categories: Fediverse social posts, personal messages, creative notes

Algorithm Dataset Treatment Analysis

Algorithm	Vector Preprocessing	Scale Sensitivity	Category Handling	Memory Efficiency
Cosine Similarity	L2 normalization	Scale invariant	Category agnostic	Excellent (O(n))
Jensen-Shannon Divergence	Probability normalization	Scale invariant	Category agnostic	Excellent (O(n))
Euclidean Distance	Optional standardization	Scale sensitive	Category agnostic	Excellent (O(n))
Manhattan Distance	Optional standardization	Moderately sensitive	Category agnostic	Excellent (O(n))
Pearson Correlation	Mean centering	Scale invariant	Category agnostic	Excellent (O(n))
KL Divergence	Probability normalization	Scale invariant	Category agnostic	Excellent (O(n))
Soft Cosine Similarity	Term similarity matrix	Scale sensitive	Context aware	Poor (O(n²))
Word Mover's Distance	Word embedding matrix	Context dependent	Semantic aware	Very Poor (O(n³))
Chebyshev Distance	Optional standardization	Scale sensitive	Category agnostic	Excellent (O(n))
Spearman Correlation	Rank transformation	Scale invariant	Category agnostic	Good (O(n log n))
BERT Score	Contextual embeddings	Context dependent	Semantic aware	Poor (model dependent)

Dataset-Specific Considerations

Fediverse Posts (6,000 poems):

Short, conversational content with social context
Frequent mentions, hashtags, and informal language
High semantic diversity within category

Messages (1,081 poems):

Personal communication style
Mixed formal/informal registers
Contextual references that may be opaque

Notes (274 poems):

Creative and reflective content
Longer-form poetic expression
Higher literary/artistic value

2. Algorithm Technical Analysis

2.1 Distance-Based Similarity Measures

Cosine Similarity ⭐ CURRENTLY IMPLEMENTED

Formula: cos(θ) = (A · B) / (||A|| × ||B||)
Range: [-1, 1] (converted to [0, 1] for similarity)

How it Works:
Measures the angle between two vectors in high-dimensional space. Divides the dot product by the product of vector magnitudes, making it scale-invariant. Captures directional similarity regardless of vector magnitude.