How the Similarity Works

How the similarity actually works.

EMBEDDINGS
  Every poem is run through a sentence-embedding model
  (embeddinggemma-300m) that turns its text into a fixed-length list
  of numbers -- a point in a high-dimensional space. Poems that mean
  similar things land near each other. Vectors are stored at half
  precision (FP16) to save space and computed at full precision (FP32).

COSINE SIMILARITY
  To compare two poems we measure the ANGLE between their vectors, not
  the distance. Two poems point the same way (angle ~ 0) when they are
  alike; they point apart when they differ. 'similar' sorts by smallest
  angle; 'different' sorts by largest -- maximum contrast, which is not
  the same as 'unrelated noise'.

THE MOOD COLOR MAP
  Poems are clustered into semantic 'moods' around centroids (average
  points), and each mood gets a color. That is why the progress bars and
  word colors carry meaning instead of being decorative.

DIVERSITY SEQUENCES
  The 'different' ordering is precomputed so that CONSECUTIVE poems stay
  maximally spread out -- a walk that keeps surprising you rather than
  drifting into one corner of the space.

THE TRIANGULAR SIMILARITY MATRIX
  Similarity is symmetric (A to B equals B to A), so only the upper
  triangle is stored -- about half the memory -- and addressed with a
  little index arithmetic instead of a full square.

THE WORD CLOUD ON THE MENU
  The menu's word cloud sizes each word by how often it shows up, on a
  LOGARITHMIC scale. Word counts follow Zipf's law -- a handful of words
  dominate and a long tail barely appears -- so a plain linear size would
  flatten almost everything to the smallest tier; the log stretches the
  small counts back out. Frequency decides SIZE, and nothing else.

  WHERE each word lands is not computed at all. The words go out as one
  plain run of links and the browser flows them left to right, wrapping
  the way a paragraph does -- no coordinates, no spiral, no overlap test.
  The 'cloud' impression comes from the mix of sizes, not from packing
  shapes together.

  Left in size order the big words would all bunch at the front, so the
  list is shuffled first with a Fisher-Yates pass: walk it from the last
  word back to the first, and swap each word with a randomly chosen one at
  or before it. A single sweep, and every possible ordering is equally
  likely. The shuffle draws from the build's master seed (Issue 10-058),
  recorded in output/generation-metadata.json -- so the SAME seed always
  produces the SAME arrangement, and the order only changes when the seed
  does. The sizes and the colors stay fixed regardless of the seed.

  (Placing related words NEAR each other -- a layout driven by the same
  embeddings the rest of this page describes -- is something the cloud
  does not do today.)

THE SHAPE OF THIS CORPUS (counted at build time)

  Poems per source (of 7904 total):
    fediverse            | ████████████████████████████████████████  5977
    messages             | ██████████······························  1558
    notes                | ██······································  322
    bluesky              | ········································  47

  Poem length (characters):
    0-99                 | ████████████████████████████████████████  2621
    100-249              | ████████████████████████████████········  2119
    250-499              | ██████████████████······················  1185
    500-999              | ███████████████·························  988
    1000-1999            | ████████████····························  808
    2000+                | ███·····································  183

  Poems per year:
    2021                 | █·······································  102
    2022                 | █·······································  90
    2023                 | █████···································  528
    2024                 | ████████████████████████████████████████  4625
    2025                 | ██████████████████······················  2054
    2026                 | ████····································  505

  (Similarity-score distributions live in the precomputed matrix that
   this page does not load; they are a planned addition -- see issue
   11-004.)

Back to Explore