libs/model-evaluator.info.md
model-evaluator.lua
Pure comparison + statistics for the embedding-model evaluation framework
(Issue 10-031). No IO, no model/server knowledge — it takes plain Lua tables
(vectors and text) and returns plain Lua tables, so the math is testable in
isolation and the generation/orchestration layer owns the messy parts.
Used by src/model-comparison.lua (the report step). See alsoscripts/evaluate-embedding-models (the orchestrator) and Issue 10-031.
Functions
M.cosine(a, b) -> number
Cosine similarity of two equal-length vectors. Only ever compared within a
single model's space (dimensions differ across models), so absolute values are
per-model and only rankings are compared between models.
M.rank_anchor(anchor_vec, pool, exclude_index, top_k) -> list
Rank every vector in pool (map of poem_index -> vector) by cosine toanchor_vec, nearest first, dropping exclude_index (the anchor itself).
Returns an array of { poem_index, score }, length min(top_k, #pool-1).
Deterministic tiebreak (poem_index ascending).
M.topk_agreement(rank_a, rank_b, k) -> number
How many poems appear in BOTH rankings' top-k (set overlap, order ignored). The
headline "do the models even pick the same poems" number.
M.kendall_tau(rank_a, rank_b) -> (number|nil, n)
Kendall's tau-b over the poems the two rankings share: +1 identical order, 0
unrelated, -1 reversed. Returns nil, n when fewer than two items are shared
(correlation undefined). O(n²) in the shared set — fine for top-N slices.
M.lexical_jaccard(text_a, text_b) -> number
Jaccard overlap of the two poems' lowercased word sets, in [0,1]. High = they
literally share words (surface/structural kinship); low = a model called them
similar despite little shared vocabulary (so it rewards meaning/theme/tone, not
words). The crude "surface similarity" contrast to the neural models.
M.personality(anchor_text, anchor_len, ranked, texts, lengths, k) -> table
Turn a model's top-k matches for one anchor into interpretable signals:
mean_jaccard— avg word-overlap of anchor with its top matches (surface lean)mean_len_ratio— avgmin/maxof word counts (near 1 = length bias)mean_score— avg cosine of the top matches (per-model neighbourhood tightness)n— how many matches contributed
Descriptive, not a verdict — the report shows them for a human to interpret.
M.mean(list) -> number|nil
Arithmetic mean of a numeric array, or nil if empty (so callers render "n/a"
rather than divide by zero).