issues/6-033-enhance-embedding-content-preprocessing.md
Issue 6-033: Enhance Embedding Content Preprocessing
Priority
High
Current Behavior
The extract_pure_poem_content() function (Issue 6-029) removes reply syntax but still allows several types of metadata and formatting artifacts to leak into embedding content:
Problem 1: Dashed Content Warnings
Content warnings like cannabis-mentioned, food-recipe-mentioned, unreasonable-violence contain compound words that are rare in embedding model training data:
CW: cannabis-mentioned → Embedding receives "cannabis-mentioned" (rare token)
CW: food-recipe-mentioned → Embedding receives "food-recipe-mentioned" (rare token)
CW: politics-mentioned-radical-revolution-minus-the-revolt-please → Extremely rare!
The embedding model likely handles "cannabis mentioned" (with space) much better than "cannabis-mentioned" (hyphenated compound).
Problem 2: File Metadata Leaking
File paths and metadata are appearing in poem content:
-> file: fediverse/1678.txt
file: /home/ritz/words/fediverse/2564.txt
-> file: fediverse/0234.txt
Problem 3: Separator Lines Leaking
Box-drawing and dash separators appear in content:
--------------------------------------------------------------------------------
---
Problem 4: Multiple Poems Concatenated
Some content entries contain multiple poems merged together, with their own CW: prefixes:
CW: cooking-food-mentioned
Paprika is the best spice, fite me
-> file: fediverse/1678.txt
--------------------------------------------------------------------------------
CW: re: cooking-food-mentioned
@trdebunked mmmm, paprika for flavor...
Intended Behavior
For Embeddings Only
Content sent to the embedding model should be:
- Dashes in CWs converted to spaces:
cannabis-mentioned→cannabis mentioned - File metadata stripped: All
-> file:andfile:lines removed - Separator lines stripped: All
----lines removed - Single poem isolation: Only the first poem if multiple are concatenated
For Display (unchanged)
The original content with dashes is preserved for display purposes.
Technical Approach
Enhanced extract_pure_poem_content() for Embeddings
-- {{{ function M.extract_pure_poem_content_for_embedding
function M.extract_pure_poem_content_for_embedding(processed_content)
local content = M.extract_pure_poem_content(processed_content)
-- Convert dashes to spaces in entire content for better embedding tokenization
-- This helps the model understand "cannabis mentioned" vs rare "cannabis-mentioned"
content = content:gsub("%-", " ")
-- Remove file path metadata
content = content:gsub("%s*%->%s*file:[^\n]*\n?", "")
content = content:gsub("file:%s*/[^\n]*\n?", "")
-- Remove separator lines (4+ dashes)
content = content:gsub("\n%-%-%-%-+\n", "\n")
content = content:gsub("^%-%-%-%-+\n", "")
content = content:gsub("\n%-%-%-%-+$", "")
-- Stop at the first separator if multiple poems concatenated
-- (Everything after a ---- line is likely a different poem)
local first_separator = content:find("\n%-%-%-%-")
if first_separator then
content = content:sub(1, first_separator - 1)
end
-- Clean up multiple consecutive spaces from dash removal
content = content:gsub("%s+", " "):gsub("^%s*", ""):gsub("%s*$", "")
return content
end
-- }}}
Usage in Embedding Generation
Update similarity-engine.lua to use the enhanced function:
-- Before:
local poem_text = poem_extractor.extract_pure_poem_content(poem.content)
-- After:
local poem_text = poem_extractor.extract_pure_poem_content_for_embedding(poem.content)
Examples
Before Enhancement
Input: "CW: cannabis-mentioned\n\nfor me my identities are..."
Output: "cannabis-mentioned\nfor me my identities are..."
After Enhancement
Input: "CW: cannabis-mentioned\n\nfor me my identities are..."
Output: "cannabis mentioned for me my identities are..."
Concatenated Poems - Before
Input: "CW: food\n\nPaprika is great\n\n -> file: 1678.txt\n----\nCW: re: food\n\n@user reply"
Output: "food\nPaprika is great\n\n -> file: 1678.txt\n----\nre: food\nreply"
Concatenated Poems - After
Input: "CW: food\n\nPaprika is great\n\n -> file: 1678.txt\n----\nCW: re: food\n\n@user reply"
Output: "food Paprika is great"
Suggested Implementation Steps
- Create new function
extract_pure_poem_content_for_embedding()inpoem-extractor.lua - Add dash-to-space conversion for better tokenization
- Add file metadata removal patterns
- Add separator detection to isolate single poems
- Update similarity-engine.lua to use new function for embeddings
- Test with sample poems that have dashed CWs and metadata
- Document that display uses original content, embeddings use preprocessed content
Impact Assessment
Embedding Quality Improvements
- Better tokenization: Common words separated by spaces tokenize predictably
- Cleaner content: No file paths or separators contaminating semantic analysis
- Single poems: Each embedding represents one discrete piece of content
- Content warning relevance: CW topics contribute meaningfully to similarity
Backward Compatibility
- Display unchanged: Original content preserved for HTML generation
- New function: Doesn't modify existing
extract_pure_poem_content()behavior - Embedding regeneration: Will need to regenerate embeddings after implementing
Related Documents
/issues/completed/6-029-remove-reply-syntax-from-embedding-content.md- Prior work on reply removal/src/poem-extractor.lua- Implementation location/src/similarity-engine.lua- Consumer of embedding content
Quality Assurance
Test Cases
local test_cases = {
-- Dashed CW
{"CW: cannabis-mentioned\n\ntest", "cannabis mentioned test"},
-- File metadata
{"content\n -> file: test.txt\nmore", "content more"},
-- Separator
{"poem1\n----\npoem2", "poem1"},
-- Combined
{"CW: food-recipe\n\ntext\n -> file: x.txt\n----\nCW: other\n\nmore", "food recipe text"}
}
Implementation Progress
2026-01-21: Implemented
Changes made:
src/poem-extractor.lua(lines 642-682):
- Added
M.extract_pure_poem_content_for_embedding()function - Converts dashes to spaces for better tokenization
- Strips file path metadata (
-> file:,file: /path) - Strips separator lines (
----) - Isolates first poem if multiple concatenated
src/similarity-engine.lua(line 484-486):
- Updated to use
extract_pure_poem_content_for_embedding()instead ofextract_pure_poem_content() - Comment added explaining the change
Pending:
- [ ] Regenerate all embeddings to apply the enhanced preprocessing
- [ ] Verify improved similarity quality after regeneration
Metadata
- Status: ✅ Implementation Complete - Pending Embedding Regeneration
- Created: 2026-01-21
- Last Updated: 2026-01-21
- Phase: 6 (Embedding Quality)
- Blocked By: None
- Blocks: Embedding regeneration for improved similarity
- Related: 6-029 (reply syntax removal)
- Estimated Effort: Low (straightforward string processing)