How the similarity actually works.
EMBEDDINGS
Every poem is run through a sentence-embedding model
(embeddinggemma-300m) that turns its text into a fixed-length list
of numbers -- a point in a high-dimensional space. Poems that mean
similar things land near each other. Vectors are stored at half
precision (FP16) to save space and computed at full precision (FP32).
COSINE SIMILARITY
To compare two poems we measure the ANGLE between their vectors, not
the distance. Two poems point the same way (angle ~ 0) when they are
alike; they point apart when they differ. 'similar' sorts by smallest
angle; 'different' sorts by largest -- maximum contrast, which is not
the same as 'unrelated noise'.
THE MOOD COLOR MAP
Poems are clustered into semantic 'moods' around centroids (average
points), and each mood gets a color. That is why the progress bars and
word colors carry meaning instead of being decorative.
DIVERSITY SEQUENCES
The 'different' ordering is precomputed so that CONSECUTIVE poems stay
maximally spread out -- a walk that keeps surprising you rather than
drifting into one corner of the space.
THE TRIANGULAR SIMILARITY MATRIX
Similarity is symmetric (A to B equals B to A), so only the upper
triangle is stored -- about half the memory -- and addressed with a
little index arithmetic instead of a full square.
THE WORD CLOUD ON THE MENU
The menu's word cloud sizes each word by how often it shows up, on a
LOGARITHMIC scale. Word counts follow Zipf's law -- a handful of words
dominate and a long tail barely appears -- so a plain linear size would
flatten almost everything to the smallest tier; the log stretches the
small counts back out. Frequency decides SIZE, and nothing else.
WHERE each word lands is not computed at all. The words go out as one
plain run of links and the browser flows them left to right, wrapping
the way a paragraph does -- no coordinates, no spiral, no overlap test.
The 'cloud' impression comes from the mix of sizes, not from packing
shapes together.
Left in size order the big words would all bunch at the front, so the
list is shuffled first with a Fisher-Yates pass: walk it from the last
word back to the first, and swap each word with a randomly chosen one at
or before it. A single sweep, and every possible ordering is equally
likely. The shuffle draws from the build's master seed (Issue 10-058),
recorded in output/generation-metadata.json -- so the SAME seed always
produces the SAME arrangement, and the order only changes when the seed
does. The sizes and the colors stay fixed regardless of the seed.
(Placing related words NEAR each other -- a layout driven by the same
embeddings the rest of this page describes -- is something the cloud
does not do today.)
THE SHAPE OF THIS CORPUS (counted at build time)
Poems per source (of 7904 total):
fediverse | ████████████████████████████████████████ 5977
messages | ██████████······························ 1558
notes | ██······································ 322
bluesky | ········································ 47
Poem length (characters):
0-99 | ████████████████████████████████████████ 2621
100-249 | ████████████████████████████████········ 2119
250-499 | ██████████████████······················ 1185
500-999 | ███████████████························· 988
1000-1999 | ████████████···························· 808
2000+ | ███····································· 183
Poems per year:
2021 | █······································· 102
2022 | █······································· 90
2023 | █████··································· 528
2024 | ████████████████████████████████████████ 4625
2025 | ██████████████████······················ 2054
2026 | ████···································· 505
(Similarity-score distributions live in the precomputed matrix that
this page does not load; they are a planned addition -- see issue
11-004.)
Back to Explore
|