AI Pipelines for Bible Study

ai-engineering

AI Pipelines for Bible Study

A single LLM call is not an AI system. It is a prompt and a response — useful for one-shot questions but inadequate for the kind of structured reasoning that serious Bible study requires. What makes the difference between a capable theological AI and an expensive autocomplete is pipeline design: the sequencing, coordination, and composition of multiple steps that each contribute to a richer, more grounded final response.

This article describes how Bible study AI pipelines are structured on this platform — what operations run in which order, why the sequence matters, and how the pipeline design responds to the specific demands of theological content.

What a Pipeline Buys You

The case for multi-step pipelines over single-shot generation comes down to one observation: the quality of an LLM response is bounded by the quality of its inputs. A model given a bare question about the meaning of a word in an ancient text will produce a response based on what it learned in training. The same model given the actual lexical entry, four parallel translations of the verse, a commentary from a relevant scholar, the cross-references that link the passage to its Old Testament background, and the user's prior study notes will produce a fundamentally better response — not because the model got smarter but because the inputs improved.

Pipeline design is mostly the engineering work of making sure the model has the right inputs at the right time, structured in a way that surfaces what matters and suppresses what does not.

The Core Bible Study Pipeline

The platform's primary Bible study flow runs through four sequential stages with two parallel fan-out operations embedded in the first stage.

User question
     │
     ▼
┌─────────────────────────────────┐
│  Stage 1: Context Assembly      │
│  ┌──────────────┬─────────────┐ │
│  │ Retrieval    │ Memory      │ │
│  │ (parallel)   │ Lookup      │ │
│  └──────────────┴─────────────┘ │
└─────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────┐
│  Stage 2: Analysis              │
│  Query classification +         │
│  context enrichment             │
└─────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────┐
│  Stage 3: Generation            │
│  Model call with full context   │
└─────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────┐
│  Stage 4: Post-Processing       │
│  Citation verification +        │
│  memory extraction              │
└─────────────────────────────────┘
     │
     ▼
Response to user

Each stage has a distinct purpose and a distinct failure mode, which is why they are kept separate rather than collapsed.

Stage 1: Context Assembly

The first stage runs before any generation happens. Its job is to gather every piece of evidence that could improve the response. Two sub-operations run in parallel:

Retrieval

Retrieval runs five concurrent searches and merges the results:

async function assembleRetrievalContext(
  query: string,
  analysis: QueryAnalysis,
  userId: string
): Promise<RetrievalContext> {
  const [
    scriptureResults,
    commentaryResults,
    lexicalResults,
    crossRefResults,
    userNoteResults,
  ] = await Promise.all([
    retrieveScripture(query, analysis),
    retrieveCommentary(query, analysis),
    retrieveLexicalEntries(analysis.keyTerms),
    traverseCrossReferences(analysis.explicitRefs),
    retrieveUserNotes(query, userId),
  ]);

  return mergeWithSourceWeighting({
    scripture: scriptureResults,       // weight: 1.0
    commentary: commentaryResults,     // weight: 0.8
    lexical: lexicalResults,           // weight: 0.9
    crossRef: crossRefResults,         // weight: 0.85
    userNotes: userNoteResults,        // weight: 0.7
  });
}

Source weighting determines how results are ordered in the context block when token limits require pruning. Scripture always comes first — it is the primary text. Lexical entries rank high because they provide direct word-level grounding. Cross-references rank above commentary because they are structural relationships in the text, not interpretive opinions. User notes rank last — they are personal and context-specific, valuable but not authoritative.

The parallel execution is important. These five searches are independent. Running them sequentially would add 200–400ms of unnecessary latency per search. Running in parallel, the stage completes in roughly the time of the slowest individual search — typically cross-reference traversal against the graph database at 150–200ms.

Memory Lookup

In parallel with retrieval, the user's memory is queried:

async function assembleMemoryContext(
  query: string,
  userId: string,
  sessionId: string
): Promise<MemoryContext> {
  const [semantic, episodic, preferences, workingState] = await Promise.all([
    semanticMemory.retrieve(query, userId),     // what they've studied before
    episodicMemory.getRecentSession(sessionId),  // this session's prior turns
    userPreferences.get(userId),                // translation, study depth, etc.
    workingMemory.getState(sessionId),          // active task context
  ]);

  return { semantic, episodic, preferences, workingState };
}

The memory context shapes generation in ways retrieval does not. If the user has studied Romans 8 extensively in prior sessions, their semantic memory carries summaries of past discussions — the model can reference "in our previous study of predestination" rather than treating every question as if it is the first. Preferences determine which Bible translation to use for quotations, how technically to pitch the response, and whether to include original language commentary.

Context Window Construction

After both parallel operations complete, the retrieved evidence and memory context are assembled into the context block:

function buildContextWindow(
  retrieval: RetrievalContext,
  memory: MemoryContext,
  conversationHistory: Turn[],
  tokenBudget: number
): string {
  // Allocate the budget: 40% retrieval, 30% history, 20% memory, 10% overhead
  const sections = [
    { key: "memory", content: formatMemory(memory), budget: tokenBudget * 0.20 },
    { key: "history", content: formatHistory(conversationHistory), budget: tokenBudget * 0.30 },
    { key: "retrieval", content: formatRetrieval(retrieval), budget: tokenBudget * 0.40 },
  ];

  return sections
    .map(s => truncateToTokenBudget(s.content, s.budget))
    .join("\n\n---\n\n");
}

Memory goes first — it establishes the user's context and preferences before anything else. Conversation history follows — it carries the session's established thread. Retrieval evidence goes last, immediately before the user's question — recency bias in LLM attention means the closest content to the query gets the most weight, and the retrieved evidence should be that content.

Stage 2: Analysis

Before generation, a lightweight analysis step enriches the assembled context with two additions:

Query Classification

The query is classified using the assembled context, not just the raw question. A question that looks simple ("What does 'justification' mean?") gets classified differently depending on whether the user has been studying justification for weeks (semantic memory) vs asking for the first time (no prior context). The classification informs how the generation system prompt is populated:

type StudyIntent =
  | "word_study"      // user wants lexical depth
  | "passage_study"   // verse or pericope interpretation
  | "topical_survey"  // theme across scripture
  | "comparative"     // tradition or interpretation comparison
  | "personal"        // applying scripture to personal context
  | "historical"      // historical or background context;

function classifyStudyIntent(
  query: string,
  memory: MemoryContext,
  conversationHistory: Turn[]
): StudyIntent {
  // Use fast model to classify with full context
  // ...
}

Each StudyIntent maps to a different system prompt template. A word_study prompt emphasizes lexical analysis and original language content. A topical_survey prompt emphasizes breadth across the canon. A personal prompt foregrounds the user's notes and prior reflections. The same model, different prompts, substantially different responses.

Context Enrichment

When the retrieval context is sparse (fewer than 3 high-confidence results), the analysis stage attempts to enrich it with targeted follow-up searches:

async function enrichSparseContext(
  retrieval: RetrievalContext,
  query: string,
  analysis: QueryAnalysis
): Promise<RetrievalContext> {
  if (retrieval.highConfidenceCount >= 3) return retrieval;

  // Reformulate and retry with broader semantic scope
  const expandedQuery = await expandTheologicalQuery(query);
  const fallback = await retrieveScripture(expandedQuery, { ...analysis, minScore: 0.45 });

  return mergeRetrieval(retrieval, fallback);
}

This is a lightweight self-correction: if the initial retrieval was inadequate, try again with a broader query before the model is given insufficient context to work with. The alternative — generating from sparse context — risks hallucination and low-confidence responses.

Stage 3: Generation

The generation stage is a single model call with the full assembled context. The prompt structure:

[System prompt — role, constraints, citation requirements, study intent]
[Memory context — user profile, preferences, prior study]
[Conversation history — current session turns]
[Retrieved evidence — scripture, lexical, cross-refs, commentary, user notes]
[Retrieval quality signal — how much evidence was found]
[User's question]

The key constraint in the system prompt is the citation instruction from article 13: every theological claim must be supported by a [CITE: ref] marker pointing to retrieved content. This constraint is enforced structurally — the generation is only trusted if claims are citable.

Token budget for the generation itself is computed from what is left after the context window is assembled. If the context window consumed 60,000 tokens of a 100,000 token context limit, generation gets approximately 4,000 tokens (leaving a buffer). In practice, most responses stay within 1,000–2,000 tokens; the budget is rarely binding.

Streaming

Generation streams directly to the client. The user sees tokens arriving in real time rather than waiting for the full response to complete. This is implemented with SSE and a ReadableStream on the client:

async function* streamStudyResponse(
  context: AssembledContext
): AsyncGenerator<string> {
  const stream = await anthropic.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: context.generationBudget,
    system: context.systemPrompt,
    messages: context.messages,
  });

  for await (const chunk of stream) {
    if (chunk.type === "content_block_delta") {
      yield chunk.delta.text;
    }
  }
}

Post-processing runs asynchronously after streaming completes — the user does not wait for citation verification to start reading.

Stage 4: Post-Processing

After generation completes, two operations run concurrently:

Citation Verification

Every [CITE: ref] marker in the generated response is verified against the retrieved context (article 13 covers this in detail). The verification result annotates each citation:

  • Verified and in context → displayed with full verse text
  • In context but support level weak → displayed with qualifier
  • Not in context → flagged; user is notified

Verification runs asynchronously. The client receives the raw generated text first via streaming, then receives a citation annotation event over the same SSE channel:

// Server sends annotations after streaming completes
res.write(`event: citations\ndata: ${JSON.stringify(citationAnnotations)}\n\n`);

The client updates the displayed response in-place: citation markers transform from plain text into linked, expandable footnotes.

Memory Extraction

After generation, a fast model call extracts memory-worthy information from the exchange:

async function extractMemory(
  userQuery: string,
  modelResponse: string,
  citedPassages: string[],
  userId: string
): Promise<void> {
  const extraction = await fastModel.complete({
    prompt: `From this Bible study exchange, identify:
1. Passages the user asked about specifically (for episodic memory)
2. Theological topics explored in depth (for semantic memory)
3. Any explicitly stated study preferences or goals (for user profile)

User question: ${userQuery}
Response covered: ${summarize(modelResponse)}
Cited passages: ${citedPassages.join(", ")}

Return a JSON object with: { passages: [], topics: [], preferences: [] }`,
    maxTokens: 150,
  });

  await memoryService.ingest(extraction, userId);
}

Memory extraction runs entirely in the background — the user's response is not delayed. But the extracted information is available immediately for subsequent questions in the same session and persists to future sessions.

Pipeline Variants

The core four-stage pipeline handles most queries, but specific study tasks use pipeline variants:

Study Guide Generation Pipeline

Generating a multi-section study guide is a multi-step task that requires more structure than the conversational pipeline:

1. Outline generation — model produces a structured plan: key questions, 
   target passages, discussion angles
2. Per-section retrieval — each planned section triggers its own retrieval
3. Parallel section drafting — sections are drafted concurrently
4. Assembly + coherence check — sections are merged and a fast model 
   checks that transitions between sections are logical
5. Citation verification — full citation check across all sections

The parallel section drafting step reduces latency significantly. A five-section study guide where each section takes 4 seconds to generate runs in 4 seconds (parallel) rather than 20 seconds (sequential). The per-section retrieval ensures each section has evidence tailored to its specific questions rather than sharing one context block that may not be deep enough on any individual topic.

Word Study Pipeline

Word studies have a distinct flow because the primary sources are lexical rather than propositional:

1. Term normalization — transliterated terms converted to Unicode, 
   Strong's numbers resolved
2. Lexical retrieval — BDAG, Thayer, BDB, Strong's entries retrieved
3. Occurrence retrieval — all occurrences of the term in the canon
4. Semantic field mapping — related terms retrieved from the graph
5. Usage pattern analysis — a model call that reads all occurrences 
   and synthesizes the term's range of meaning
6. Commentary retrieval — commentary specifically on lexical/exegetical 
   questions for key occurrences
7. Synthesis generation — final response drawing on all of the above

The usage pattern analysis step (step 5) is a second model call — it exists because analyzing 50+ occurrences of a term across the canon and synthesizing the semantic range requires more token budget and focused attention than the final synthesis call should spend on it. Breaking it into two model calls (analysis then synthesis) produces better results than asking one model call to do both.

Cross-Reference Exploration Pipeline

When a user wants to follow a theme through the Bible, the pipeline expands outward from a seed passage:

1. Seed passage retrieval — the starting verse and its pericope
2. Graph traversal — cross-reference graph expansion (1-2 hops)
3. Thematic clustering — retrieved passages clustered by theme
4. Cluster selection — fast model selects the 3 most relevant clusters
5. Per-cluster evidence assembly — retrieval deepens within each cluster
6. Synthesis — model traces the theme across clusters with narrative arc

The graph traversal in step 2 can return dozens of cross-references. The thematic clustering in step 3 prevents the synthesis step from receiving an undifferentiated flood of loosely related passages — it groups them so the model can see structure rather than a list.

Observability in the Pipeline

Each pipeline stage emits structured events that are captured in distributed tracing. A complete pipeline trace shows:

Total: 1,847ms

  Context assembly:        412ms
    ├─ Scripture retrieval:   187ms
    ├─ Commentary retrieval:  203ms (longest — cross-encoder reranking)
    ├─ Lexical retrieval:     121ms
    ├─ Cross-ref traversal:   156ms
    ├─ User note retrieval:    89ms
    └─ Memory lookup:         134ms (parallel with retrieval)

  Analysis:                 89ms
    └─ Query classification:  89ms

  Generation (TTFT):       723ms
  Generation (full):     1,241ms

  Post-processing:         105ms (async, not in user-facing critical path)
    ├─ Citation verification: 87ms
    └─ Memory extraction:     105ms

The trace surfaces where time is going. Commentary retrieval is consistently the slowest retrieval step because cross-encoder reranking is CPU-bound. This is the obvious first target if generation latency becomes a user experience issue — either cache aggressively or run the reranker on dedicated hardware.

Post-processing latency does not appear in the user-facing critical path because it runs after streaming begins. The user starts reading at TTFT (723ms); citation annotations arrive at 1,952ms — well within the time the user spends reading the response.

Pipeline Design Principles

Seven patterns that emerged from building and operating this pipeline:

Retrieve before you generate. Every piece of information the model needs should be in the context window, not in training memory. Retrieval is cheap relative to the quality improvement it provides.

Parallelize what is independent. Retrieval sub-operations do not depend on each other. Running them sequentially wastes 3–4× the necessary latency.

Break on complexity. When a task requires different reasoning at different stages (word study analysis then synthesis), break it into separate model calls. One call doing too many things produces worse results than two focused calls.

Classify before generating. The intent of the query determines the system prompt. A one-size-fits-all prompt produces mediocre responses across the board.

Post-process asynchronously. Citation verification and memory extraction do not block the user's experience. Run them after streaming begins and update the client when they complete.

Make failures graceful. When retrieval is sparse, enrich the context or tell the model explicitly that evidence is limited. Do not let the model generate confidently from inadequate inputs.

Log everything. Each stage should emit structured events. The trace is the only reliable way to understand pipeline behavior under production load conditions — you cannot diagnose a 2-second response without knowing which of the four stages consumed that 2 seconds.

The pipeline structure on this platform grew from a single-call prototype that worked well for simple questions and poorly for everything else. Each stage was added when the previous version hit a quality ceiling: retrieval when generation from memory was too unreliable, memory when retrieval alone missed user context, classification when one prompt served all intents poorly, post-processing when citation failures were invisible. The final pipeline is not a designed-upfront architecture — it is the accumulated response to observed failures, one stage at a time.

Case Study

In Progress

Bible Verse — Case Study

Production SaaS Platform · Full-Stack · Founder & Sole Engineer

A domain-driven SaaS platform with five independently scalable system boundaries: scripture content delivery, RAG-backed AI study, real-time community interaction, async media processing, and infrastructure services — built and operated end-to-end.

Our Results

37K+
Verses Indexed
5
AI Models
5
Bounded Domains
3
Job Queues

How We Built It

  • RAG pipeline grounding AI responses in actual scripture rather than model memory
  • Hybrid Llama / OpenAI routing — local inference for cost, API fallback for quality at the edge
  • Non-blocking media processing — FFmpeg jobs enqueued via BullMQ, API never waits on transcoding
  • Cross-instance real-time consistency via Redis pub/sub behind WebSocket and WebRTC layers

Lessons Learned

  • Domain boundaries enforced at the service layer prevent coupling long before scale demands microservices.
  • RAG retrieval quality matters more than model size — better embeddings outperform a larger model on poor context.
  • Async queue design should be first-class, not bolted on; BullMQ worker isolation saved the request path repeatedly.

Stack

Nuxt 3TypeScriptNitroPostgreSQLPrismaRedisBullMQWeaviateMinIOFFmpegWebRTCWebSocketsLlama 3.2OpenAI APIKubernetes
View Full Case Study

Written by

Full-Stack Engineer & Systems Architect

5+ years building production systems · AI, Backend & Infrastructure · Founder of Bible Logic

Full-stack engineer with 5+ years of hands-on experience designing and shipping production systems — from Nuxt 3 frontends and Nitro APIs to self-hosted Kubernetes clusters, RAG pipelines, and real-time AI applications. Everything I write comes from systems I've designed, deployed, and operated in production.

5+ Years Experience AI Systems Specialist Kubernetes & Infrastructure
Nuxt 3TypeScriptPostgreSQLKubernetesRAG / LLMWebRTCAWS IVSRedis