issues/6-033-enhance-embedding-content-preprocessing.md

Issue 6-033: Enhance Embedding Content Preprocessing

Priority

High

Current Behavior

The extract_pure_poem_content() function (Issue 6-029) removes reply syntax but still allows several types of metadata and formatting artifacts to leak into embedding content:

Problem 1: Dashed Content Warnings

Content warnings like cannabis-mentioned, food-recipe-mentioned, unreasonable-violence contain compound words that are rare in embedding model training data:

CW: cannabis-mentioned     → Embedding receives "cannabis-mentioned" (rare token)
CW: food-recipe-mentioned  → Embedding receives "food-recipe-mentioned" (rare token)
CW: politics-mentioned-radical-revolution-minus-the-revolt-please → Extremely rare!

The embedding model likely handles "cannabis mentioned" (with space) much better than "cannabis-mentioned" (hyphenated compound).

Problem 2: File Metadata Leaking

File paths and metadata are appearing in poem content:

 -> file: fediverse/1678.txt
file: /home/ritz/words/fediverse/2564.txt
 -> file: fediverse/0234.txt

Problem 3: Separator Lines Leaking

Box-drawing and dash separators appear in content:

--------------------------------------------------------------------------------
---

Problem 4: Multiple Poems Concatenated

Some content entries contain multiple poems merged together, with their own CW: prefixes:

CW: cooking-food-mentioned

Paprika is the best spice, fite me

 -> file: fediverse/1678.txt
--------------------------------------------------------------------------------
CW: re: cooking-food-mentioned

@trdebunked mmmm, paprika for flavor...

Intended Behavior

For Embeddings Only

Content sent to the embedding model should be:

Dashes in CWs converted to spaces: cannabis-mentioned → cannabis mentioned
File metadata stripped: All -> file: and file: lines removed
Separator lines stripped: All ---- lines removed
Single poem isolation: Only the first poem if multiple are concatenated

For Display (unchanged)

The original content with dashes is preserved for display purposes.

Technical Approach

Enhanced `extract_pure_poem_content()` for Embeddings

-- {{{ function M.extract_pure_poem_content_for_embedding
function M.extract_pure_poem_content_for_embedding(processed_content)
    local content = M.extract_pure_poem_content(processed_content)

    -- Convert dashes to spaces in entire content for better embedding tokenization
    -- This helps the model understand "cannabis mentioned" vs rare "cannabis-mentioned"
    content = content:gsub("%-", " ")

    -- Remove file path metadata
    content = content:gsub("%s*%->%s*file:[^\n]*\n?", "")
    content = content:gsub("file:%s*/[^\n]*\n?", "")

    -- Remove separator lines (4+ dashes)
    content = content:gsub("\n%-%-%-%-+\n", "\n")
    content = content:gsub("^%-%-%-%-+\n", "")
    content = content:gsub("\n%-%-%-%-+$", "")

    -- Stop at the first separator if multiple poems concatenated
    -- (Everything after a ---- line is likely a different poem)
    local first_separator = content:find("\n%-%-%-%-")
    if first_separator then
        content = content:sub(1, first_separator - 1)
    end

    -- Clean up multiple consecutive spaces from dash removal
    content = content:gsub("%s+", " "):gsub("^%s*", ""):gsub("%s*$", "")

    return content
end
-- }}}

Usage in Embedding Generation

Update similarity-engine.lua to use the enhanced function:

-- Before:
local poem_text = poem_extractor.extract_pure_poem_content(poem.content)

-- After:
local poem_text = poem_extractor.extract_pure_poem_content_for_embedding(poem.content)

Examples

Before Enhancement

Input:  "CW: cannabis-mentioned\n\nfor me my identities are..."
Output: "cannabis-mentioned\nfor me my identities are..."

After Enhancement

Input:  "CW: cannabis-mentioned\n\nfor me my identities are..."
Output: "cannabis mentioned for me my identities are..."

Concatenated Poems - Before

Input:  "CW: food\n\nPaprika is great\n\n -> file: 1678.txt\n----\nCW: re: food\n\n@user reply"
Output: "food\nPaprika is great\n\n -> file: 1678.txt\n----\nre: food\nreply"

Concatenated Poems - After

Input:  "CW: food\n\nPaprika is great\n\n -> file: 1678.txt\n----\nCW: re: food\n\n@user reply"
Output: "food Paprika is great"

Suggested Implementation Steps

Create new function extract_pure_poem_content_for_embedding() in poem-extractor.lua
Add dash-to-space conversion for better tokenization
Add file metadata removal patterns
Add separator detection to isolate single poems
Update similarity-engine.lua to use new function for embeddings
Test with sample poems that have dashed CWs and metadata
Document that display uses original content, embeddings use preprocessed content

Impact Assessment

Embedding Quality Improvements

Better tokenization: Common words separated by spaces tokenize predictably
Cleaner content: No file paths or separators contaminating semantic analysis
Single poems: Each embedding represents one discrete piece of content
Content warning relevance: CW topics contribute meaningfully to similarity

Backward Compatibility

Display unchanged: Original content preserved for HTML generation
New function: Doesn't modify existing extract_pure_poem_content() behavior
Embedding regeneration: Will need to regenerate embeddings after implementing

Quality Assurance

Test Cases

local test_cases = {
    -- Dashed CW
    {"CW: cannabis-mentioned\n\ntest", "cannabis mentioned test"},
    -- File metadata
    {"content\n -> file: test.txt\nmore", "content more"},
    -- Separator
    {"poem1\n----\npoem2", "poem1"},
    -- Combined
    {"CW: food-recipe\n\ntext\n -> file: x.txt\n----\nCW: other\n\nmore", "food recipe text"}
}

Implementation Progress

2026-01-21: Implemented

Changes made:

src/poem-extractor.lua (lines 642-682):

Added M.extract_pure_poem_content_for_embedding() function
Converts dashes to spaces for better tokenization
Strips file path metadata (-> file:, file: /path)
Strips separator lines (----)
Isolates first poem if multiple concatenated

src/similarity-engine.lua (line 484-486):

Updated to use extract_pure_poem_content_for_embedding() instead of extract_pure_poem_content()
Comment added explaining the change

Pending:

[ ] Regenerate all embeddings to apply the enhanced preprocessing
[ ] Verify improved similarity quality after regeneration

Metadata

Status: ✅ Implementation Complete - Pending Embedding Regeneration
Created: 2026-01-21
Last Updated: 2026-01-21
Phase: 6 (Embedding Quality)
Blocked By: None
Blocks: Embedding regeneration for improved similarity
Related: 6-029 (reply syntax removal)
Estimated Effort: Low (straightforward string processing)