docs/data-flow-architecture.md

Data Flow Architecture

Neocities Poetry Modernization Project

This document describes the complete data flow architecture of the poetry recommendation system, which transforms ~6,860 poems into an interconnected, explorable static website using semantic embeddings.

Overview

The system follows a seven-stage pipeline that cleanly separates data generation (embeddings, similarity matrices) from data viewing (HTML generation). This separation of concerns isolates errors to smaller areas of interest and enables incremental processing.

Design Philosophy

Flat HTML: No JavaScript/CSS — pure semantic HTML that works anywhere, forever
Incremental Processing: Cache intermediate results, only recompute what's changed
Dual Discovery: Similarity (focused exploration) and diversity (expansive/"schizophrenic" exploration)
Local LLM: Embedding generation via Ollama service, no external API dependencies

Pipeline Stages

Stage 1: Extraction

Purpose: Extract raw content from ZIP archives and legacy files into structured JSON.

How Data Transforms: Imagine unpacking suitcases after returning from three different trips. Each suitcase was packed by a different person using their own system — one folded everything neatly, another just stuffed things in, a third used vacuum bags. The extraction stage opens each suitcase and sorts the contents into labeled boxes based on where they came from: "things from the beach trip," "things from the mountain trip," "things from the city trip." The items themselves aren't changed — they're just organized so that the next person can find them without needing to know the original packing method.

┌──────────────────┐    ┌──────────────┐    ┌────────────────────────┐
│ ZIP Archives     │    │              │    │ input/fediverse/       │
│ - most-recent-29 │ →  │scripts/update│ →  │   files/poems.json     │
│ - similar-diff   │    │              │    │ input/messages/        │
│                  │    │              │    │   files/poems.json     │
│ compiled.txt     │    │              │    │ input/notes/           │
│ (legacy 4MB)     │    │              │    │   files/poems.json     │
└──────────────────┘    └──────────────┘    └────────────────────────┘

Scripts involved:

/scripts/update - Orchestrates extraction
/scripts/zip-extractor.lua - Handles ZIP archive extraction
/scripts/extract-fediverse.lua - Processes Mastodon/fediverse JSON exports
/scripts/extract-messages.lua - Processes message archives
/scripts/extract-notes.lua - Processes note files

Output: Three category-specific JSON files in input/*/files/poems.json

Stage 2: Parsing

Purpose: Unify all input sources into a single normalized dataset.

How Data Transforms: Think of a translator receiving letters written in three different languages. Each letter uses different conventions — some have dates at the top, some at the bottom, some use formal greetings, others are casual. The translator rewrites every letter into a single common language using a consistent format: sender's name here, date there, body text in this section. Each letter also receives a sequential filing number so it can be referenced later. The meaning of each letter is preserved exactly, but now anyone can read them all without needing to know the original languages or conventions.

┌────────────────────┐         ┌─────────────────────────────────┐
│ poem-extractor.lua │ ──────→ │ assets/poems.json               │
│                    │         │ - 6,860 poems                   │
│ Auto-detects:      │         │ - Unified structure             │
│ - JSON extracts    │         │ - Category metadata             │
│ - Legacy compiled  │         │ - Creation timestamps           │
└────────────────────┘         └─────────────────────────────────┘

Key file: /src/poem-extractor.lua

Output structure:

{
  "poems": [
    {
      "id": "0001",
      "category": "fediverse|messages|notes",
      "content": "poem text...",
      "metadata": {
        "created_at": "2024-01-15T10:30:00Z",
        "source_id": "original_id_from_source"
      }
    }
  ],
  "metadata": {
    "total_poems": 6860,
    "extraction_date": "2024-12-23"
  }
}

Output: /assets/poems.json (10.9 MB)

Stage 3: Validation

Purpose: Ensure data quality and detect anomalies.

How Data Transforms: A quality inspector walks through the warehouse with a clipboard. They don't move or change anything — they simply observe and take notes. "Item 247 has no label." "Item 1,892 appears to be an empty box." "Items 3,400 through 3,450 all arrived on the same day, which seems unusual." The inspector's report doesn't fix problems; it documents them so that someone can decide what to do. The original items remain untouched. This stage produces a companion document — a health report that travels alongside the main dataset.

┌────────────────────┐         ┌─────────────────────────────────┐
│ poem-validator.lua │ ──────→ │ assets/validation-report.json   │
│                    │         │ - Empty content detection       │
│ Checks:            │         │ - Missing field reports         │
│ - Structure        │         │ - Data inconsistencies          │
│ - Content quality  │         │ - Quality metrics               │
│ - Field presence   │         │                                 │
└────────────────────┘         └─────────────────────────────────┘

Key file: /src/poem-validator.lua

Output: /assets/validation-report.json (6.2 MB)

Stage 4: Embedding Generation

Purpose: Transform poem text into 768-dimensional semantic vectors.

How Data Transforms: Imagine a master perfumer who can smell any flower and describe its essence using a standardized palette of 768 distinct scent notes — "three parts citrus, half part musk, two parts rain-on-pavement, zero parts vanilla," and so on. The perfumer reads each poem and distills it into this scent profile. Two poems about loneliness might have very similar profiles even if they use completely different words, while two poems that happen to share the word "blue" might smell nothing alike if one is about sadness and the other about the ocean. The original poems are not changed; each one simply receives a companion "scent card" that captures what it's about rather than what words it uses.

┌────────────────────┐         ┌─────────────────────────────────┐
│ similarity-engine  │         │ assets/embeddings/              │
│       +            │ ──────→ │   EmbeddingGemma_latest/        │
│ Ollama @ :10265    │         │     embeddings.json (64 MB)     │
│                    │         │                                 │
│ Model:             │         │ Structure:                      │
│ embeddinggemma     │         │ {metadata: {...},               │
│ (768 dimensions)   │         │  embeddings: [{poem_index, id,  │
│                    │         │   embedding: [768 floats]}, ]}  │
└────────────────────┘         └─────────────────────────────────┘

Key file: /src/similarity-engine.lua

Supported models:

Model	Dimensions	Notes
`embeddinggemma:latest`	768	Default, recommended
`text-embedding-ada-002`	1536	OpenAI-compatible
`all-MiniLM-L6-v2`	384	Lightweight

Processing features:

Incremental: Only processes new/changed poems
Per-model storage: Each model gets its own subdirectory
Fail-fast: Stops immediately on any error with detailed diagnostics
Rationale: Silent failures lead to corrupt data. Better to fail hard with

actionable error messages so issues can be identified and fixed at the source.

Error output includes: poem ID, context, and specific remediation steps

Output: /assets/embeddings/{model_name}/embeddings.json

Stage 5: Similarity Calculation

Purpose: Calculate pairwise similarity between all poems.

How Data Transforms: Now that every poem has a scent card, we can hold any two cards up and ask "how similar do these smell?" A sommelier comparing wines doesn't need to see the grapes — they compare the tasting notes. This stage compares every poem's scent card against every other poem's scent card, producing a massive web of relationships: "Poem 42 and Poem 1,337 smell almost identical; Poem 42 and Poem 500 share nothing in common." Additionally, poems are sorted into broad "neighborhoods" based on their overall character — all the citrus-heavy poems might be tagged yellow, all the earthy ones brown. The scent cards themselves don't change; this stage produces a relationship map and a set of neighborhood assignments that sit alongside everything else.

┌────────────────────┐         ┌─────────────────────────────────┐
│ similarity-engine  │         │ similarity_matrix.json (263 KB) │
│                    │ ──────→ │ - Sparse matrix format          │
│ Algorithms:        │         │ - Only meaningful similarities  │
│ - Cosine (default) │         │                                 │
│ - Euclidean        │         │ poem_colors.json (639 KB)       │
│ - Manhattan        │         │ - Semantic color assignments    │
│ - 5+ others        │         │ - Clustering visualization      │
└────────────────────┘         └─────────────────────────────────┘

Key files:

/src/similarity-engine.lua - Matrix calculation
/src/similarity-calculator.lua - Modular algorithm implementations
/src/semantic-color-calculator.lua - Color assignment

Semantic colors: red, blue, green, purple, orange, yellow, gray

Output:

/assets/embeddings/{model}/similarity_matrix.json
/assets/embeddings/{model}/poem_colors.json

Stage 6: Diversity Chaining

Purpose: Pre-compute "maximally different" poem sequences for diversity exploration.

How Data Transforms: A travel agent is asked to plan a road trip that visits the most varied landscapes possible. If you start at a beach, the next stop should be mountains; after mountains, perhaps desert; after desert, a dense forest. The agent consults the relationship map and, for each possible starting point, charts a journey that maximizes contrast at every step. "If you begin at Poem 42 and want to experience maximum variety, visit Poem 3,201 next, then Poem 789, then Poem 4,455..." These itineraries are written down and filed away so that travelers don't have to wait for route planning — the journeys are pre-charted for every possible starting point.

┌────────────────────────────┐    ┌─────────────────────────────────┐
│ precompute-diversity-      │    │ diversity_cache.json            │
│ sequences-gpu              │    │ - One file with all sequences   │
│ (Vulkan compute shaders)   │──→ │ - For each starting poem        │
│                            │    │ - Least-similar selection       │
│ Algorithm:                 │    │ - Produced in ~58s (CPU path    │
│ Greedy selection of        │    │   took ~42 hours; 2,600× win)   │
│ least-similar via GPU      │    │                                 │
└────────────────────────────┘    └─────────────────────────────────┘

Key files:

/scripts/precompute-diversity-sequences-gpu - GPU pipeline wrapper
/libs/vulkan-compute/ - Vulkan compute infrastructure (Phase 9)
/src/diversity-chaining.lua - CPU-side algorithm (legacy, retained

as the reference and fallback when GPU is unavailable)

/src/mass-diversity-generator.lua - Batch coordinator

Output: /assets/embeddings/{model}/diversity_cache.json

Stage 7: HTML Generation

Purpose: Transform all computed data into static HTML pages.

How Data Transforms: A printing press takes everything assembled so far — the unified collection, the relationship map, the neighborhood colors, the pre-charted journeys — and stamps out thousands of interconnected pages. Each page is a doorway: step through one door and you see everything arranged by similarity; step through another and you're on the pre-charted diversity journey. The press works in parallel, running multiple print heads simultaneously to produce pages faster. Nothing new is computed here; the press simply renders all the previously-gathered knowledge into a format that humans can navigate with nothing more than a web browser and the ability to click links.

┌────────────────────────┐     ┌─────────────────────────────────┐
│ flat-html-generator    │     │ output/                         │
│          +             │     │   index.html (→ chronological)  │
│ generate-html-parallel │ ──→ │   chronological.html (12 MB)    │
│   (8 threads via effil)│     │   explore.html (1 KB)           │
│                        │     │   wordcloud.html (menu + index) │
│ Template engine:       │     │   similar/0001..6860.html       │
│ /src/html-generator/   │     │   different/*.html              │
│   template-engine.lua  │     │                                 │
└────────────────────────┘     └─────────────────────────────────┘

Key files:

/src/flat-html-generator.lua - Main generation logic
/scripts/generate-html-parallel - Multi-threaded wrapper
/src/html-generator/template-engine.lua - HTML templates
/src/html-generator/url-manager.lua - Navigation URLs
/src/html-generator/golden-poem-bonus.lua - Golden poem styling

Generated pages:

Page Type	Count	Size	Description
Chronological	1	12 MB	All poems in order
Word cloud (menu)	1	varies	Site entry page; embeds the live poem index
Explore	1	1 KB	Discovery instructions
Similarity	~6,400	8.5 MB each	Per-poem similarity rankings
Diversity	~6,400	varies	Per-poem diversity chains

Complete Data Flow Diagram

[ZIP Archives] → [scripts/update] → [Temp Extraction]
                                          │
[compiled.txt] ─────────────────→ [poem-extractor.lua]
[input/*/poems.json] ────────────→        │
                                    [assets/poems.json]
                                          │
                            ┌─────────────┴─────────────┐
                            ▼                           ▼
                    [poem-validator.lua]        [image-manager.lua]
                            │                           │
              [validation-report.json]      [image-catalog.json]

[Ollama Service] → [similarity-engine.lua] ← [poems.json]
                            │
              [embeddings/{model}/embeddings.json]
                            │
              [similarity_matrix.json]
                            │
              [poem_colors.json]
                            │
              [diversity sequences]
                            │
          [flat-html-generator.lua] ← [all data sources]
                            │
              ┌─────────────┴─────────────┐
              ▼                           ▼
    [chronological.html]          [similar/*.html]
    [wordcloud.html (menu)]       [different/*.html]
              │                           │
              └─────────────┬─────────────┘
                            ▼
              [Final Website in /output/]

Entry Points

Command	Purpose
`./run.sh`	Full pipeline: update → extract → process → generate
`lua src/main.lua`	Core processing (non-interactive)
`lua src/main.lua -I`	Interactive mode for selective operations
`./generate-embeddings.sh`	Standalone embedding generation
`./phase-demo.sh`	Phase demonstration selector

Key Asset Files Summary

File	Location	Size	Purpose
poems.json	/assets/	10.9 MB	Unified poem dataset
embeddings.json	/assets/embeddings/{model}/	64 MB	Semantic vectors
similarity_matrix.json	/assets/embeddings/{model}/	263 KB	Pairwise similarities
poem_colors.json	/assets/embeddings/{model}/	639 KB	Semantic color assignments
validation-report.json	/assets/	6.2 MB	Data quality report
image-catalog.json	/assets/	326 KB	Image metadata

Configuration Files

File	Purpose
`/config/asset-paths.lua`	Configurable storage locations
`/config/golden-poem-settings.json`	Golden poem identification criteria
`/config/input-sources.json`	Input source configuration
`/config/semantic-colors.json`	Color mapping for categorization
`/config/similarity-calculator-settings.json`	Algorithm settings

External Dependencies

Dependency	Purpose	Location
Ollama	Embedding generation (CUDA-accelerated)	`http://192.168.0.115:10265`
effil	Multi-threading (HTML generation orchestrator)	`/home/ritz/programming/ai-stuff/libs/lua/effil-jit/build/`
Vulkan compute	GPU acceleration (diversity sequences, similarity rankings)	`/libs/vulkan-compute/` (project-local)
LuaJIT	Runtime	System
curl	HTTP requests	System

Parallelization Strategy

The pipeline is parallel along two distinct axes, used for different
kinds of work:

effil (CPU threads, shared via lazy-loading orchestrator) — used in

src/flat-html-generator.lua:3277+ to dispatch HTML page generation
across 8 worker threads. Workers receive small ranking slices over
effil channels, format HTML, and write files. This is the right tool
for embarrassingly-parallel work whose unit is "render one page".

Vulkan compute shaders (GPU) — used in

scripts/precompute-diversity-sequences-gpu and
scripts/generate-similarity-rankings-cache, both of which dispatch
cosine-distance and greedy-selection work to the GPU via the
libs/vulkan-compute/ FFI wrapper. This is the right tool for
matrix-heavy numeric work where the per-operation memory footprint
beats CPU cache and the GPU's parallelism dwarfs the available CPU
threads.

CUDA via Ollama — embedding generation itself runs on the GPU

through Ollama's CUDA build (see scripts/start-ollama-cuda.sh).

One step of the pipeline that is still single-threaded is word page
generation in src/generate-word-pages.lua. Issue 10-035 captures the
design (effil-orchestrator pattern, modelled on the completed Issue
10-034 HTML orchestrator) but the change has not landed yet.

User Experience Flow

Reader arrives at chronological.html
Clicks any poem → lands on /similar/XXXX.html
Sees the selected poem at top, all others ranked by similarity
Can navigate deeper into similarity (focused exploration)
Can jump to /different/XXXX.html for diversity (expansive exploration)

The dual discovery modes serve different reading strategies:

Similarity: Drill into a vein of thought
Diversity: Deliberately break out of patterns

Document History

Created: December 23, 2025
Updated: December 23, 2025 — Added "How Data Transforms" sections with analogies to each pipeline stage
Purpose: Document the complete data flow architecture for project understanding and onboarding