issues/completed/2-008-implement-per-model-embedding-storage.md
Issue 008: Implement Per-Model Embedding Storage
Current Behavior
- All embeddings stored in single
assets/embeddings.jsonfile regardless of model - No isolation between different embedding models (EmbeddingGemma, text-embedding-ada-002, etc.)
- Model changes require manual cache management or complete regeneration
- Risk of mixing embeddings from different models in similarity calculations
Intended Behavior
- Separate storage directories/files for each embedding model
- Automatic model detection and appropriate cache selection
- Seamless switching between different embedding models
- Model-specific similarity matrices and results isolation
Suggested Implementation Steps
- Directory Structure: Create model-specific storage hierarchy
- Model Detection: Automatic model identification and cache routing
- File Path Generation: Dynamic path creation based on model name
- Backward Compatibility: Handle existing cache migration
- Configuration: Model-specific settings and parameters
Technical Requirements
Storage Directory Structure
assets/
├── embeddings/
│ ├── EmbeddingGemma-latest/
│ │ ├── embeddings.json
│ │ ├── similarity_matrix.json
│ │ └── metadata.json
│ ├── text-embedding-ada-002/
│ │ ├── embeddings.json
│ │ ├── similarity_matrix.json
│ │ └── metadata.json
│ └── all-MiniLM-L6-v2/
│ ├── embeddings.json
│ ├── similarity_matrix.json
│ └── metadata.json
└── poems.json
Model Path Generation
-- {{{ local function get_model_storage_path
local function get_model_storage_path(base_dir, model_name)
-- Sanitize model name for filesystem
local safe_model_name = model_name:gsub("[^%w%-_.]", "_")
local model_dir = base_dir .. "/embeddings/" .. safe_model_name
-- Create directory if it doesn't exist
os.execute("mkdir -p " .. model_dir)
return {
embeddings = model_dir .. "/embeddings.json",
similarity_matrix = model_dir .. "/similarity_matrix.json",
metadata = model_dir .. "/metadata.json"
}
end
-- }}}
Enhanced Configuration
local embedding_models = {
["EmbeddingGemma:latest"] = {
dimensions = 768,
endpoint_path = "/api/embed",
timeout = 30
},
["text-embedding-ada-002"] = {
dimensions = 1536,
endpoint_path = "/v1/embeddings",
timeout = 60
},
["all-MiniLM-L6-v2"] = {
dimensions = 384,
endpoint_path = "/api/embed",
timeout = 20
}
}
Automatic Model Detection
function M.generate_all_embeddings(poems_file, base_output_dir, endpoint, incremental, model_name)
model_name = model_name or "EmbeddingGemma:latest"
-- Get model-specific configuration
local model_config = embedding_models[model_name]
if not model_config then
utils.log_error("Unknown embedding model: " .. model_name)
return false
end
-- Generate model-specific file paths
local storage_paths = get_model_storage_path(base_output_dir, model_name)
local embeddings_file = storage_paths.embeddings
utils.log_info("Using embedding model: " .. model_name)
utils.log_info("Storage location: " .. embeddings_file)
utils.log_info("Expected dimensions: " .. model_config.dimensions)
User Experience Improvements
Enhanced Command-Line Interface
# Bash script options
--model MODEL_NAME # Specify embedding model (default: EmbeddingGemma:latest)
--list-models # Show available models and their configurations
--model-status # Show cache status for all models
# Usage examples
./generate-embeddings.sh --model EmbeddingGemma:latest
./generate-embeddings.sh --model text-embedding-ada-002
./generate-embeddings.sh --list-models
Model Status Reporting
Available Embedding Models:
EmbeddingGemma:latest (768 dims) - 1,274 cached embeddings (18.6%)
text-embedding-ada-002 (1536 dims) - No cache found
all-MiniLM-L6-v2 (384 dims) - 6,860 cached embeddings (100%)
Currently using: EmbeddingGemma:latest
Cache location: /assets/embeddings/EmbeddingGemma-latest/embeddings.json
Backward Compatibility Migration
-- {{{ function migrate_legacy_cache
function migrate_legacy_cache(legacy_file, target_model_dir)
if utils.file_exists(legacy_file) then
utils.log_info("Migrating legacy cache to model-specific storage...")
local backup_file = legacy_file .. ".legacy_backup"
os.rename(legacy_file, backup_file)
local legacy_data = utils.read_json_file(backup_file)
if legacy_data then
utils.write_json_file(target_model_dir .. "/embeddings.json", legacy_data)
utils.log_info("Legacy cache migrated successfully")
end
end
end
-- }}}
Quality Assurance Criteria
- Different models store embeddings in separate, isolated locations
- Model switching doesn't corrupt or mix embedding data
- Similarity calculations use only embeddings from the same model
- Legacy cache migration preserves existing work
- Clear model identification in all operations and logs
Success Metrics
- Isolation: Complete separation of model-specific embeddings
- Flexibility: Easy switching between different embedding models
- Safety: No risk of mixing incompatible embeddings
- Compatibility: Seamless migration from existing single-file cache
- Transparency: Clear indication of active model and cache locations
Edge Cases Handled
- Model Name Sanitization: Special characters in model names handled safely
- Directory Creation: Automatic creation of model-specific directories
- Dimension Validation: Model-specific dimension validation
- Legacy Migration: One-time migration of existing cache
- Model Configuration: Extensible configuration for new models
USER REQUEST FULFILLMENT:
This ticket addresses the user's requirement for:
- ✅ Per-embedding-model storage directories
- ✅ Recognition of different embedding models
- ✅ Separate storage for different model's embeddings
- ✅ Isolation of model-specific results
ISSUE STATUS: COMPLETED ✅
IMPLEMENTATION COMPLETED
Date: November 3, 2025
Status: Per-model storage implemented - embeddings stored in /assets/embeddings/EmbeddingGemma_latest/ directory structure