Question 1

When should I use microservices instead of a monolith for an AI system?

Accepted Answer

Start with a monolith unless you have a specific, proven reason to split. For AI systems specifically, premature service decomposition creates coordination overhead around shared models, embedding caches, and vector stores that's hard to manage. Split when a specific component — like an embedding service or inference worker — has clearly different scaling requirements or failure characteristics from the rest of the system. Separate what needs to be separate; keep together what changes together.

Question 2

What is an embedding service and why does it need to be a backend service?

Accepted Answer

An embedding service converts text into high-dimensional vectors using a language model. It needs to be a dedicated backend service because: embedding computation is CPU/GPU-intensive and should be isolated from request-handling processes, embeddings are often reused and benefit from caching, and multiple parts of your system (ingestion pipelines, search endpoints, recommendation logic) all need embeddings but shouldn't duplicate the model loading overhead. Treating it as a first-class service makes it independently scalable and cacheable.

Question 3

How do I handle streaming responses from AI models in a backend API?

Accepted Answer

Use server-sent events (SSE) or WebSocket streams to forward the model's token-by-token output to the client. In Node.js/Nuxt backends, you pipe the stream from the model API through your server to the client response. Key implementation concerns: set appropriate timeouts (AI inference can take 10-30+ seconds), handle stream interruption gracefully, decide whether to persist the full response only after stream completion, and consider whether the client needs a message ID to reconnect if the stream drops.

Question 4

What is vector search and how does it differ from full-text search?

Accepted Answer

Full-text search matches keywords — it finds documents containing the words you typed. Vector search finds documents that are semantically similar — it finds documents that mean something similar to your query, even if they use completely different words. Vector search requires pre-computed embeddings stored in a vector database or pgvector-enabled PostgreSQL. For AI applications like RAG, semantic search, and recommendation systems, vector search is often more useful than full-text search. For filtering and exact matching, full-text search is still the right tool.

Question 5

How should I approach GPU allocation in a shared backend environment?

Accepted Answer

Treat GPU as a scarce, dedicated resource rather than a general compute pool. Use a queue (BullMQ, Redis-based) to serialize inference requests rather than allowing concurrent GPU access that causes memory contention. Profile your models' VRAM usage and set hard limits per worker process. For multi-model environments, consider model loading strategies: lazy loading saves memory but adds latency, eager loading is the opposite. In Kubernetes/K3s environments, use resource limits and node taints to ensure GPU nodes only run GPU workloads.

Menu

Backend Engineering

Start Here

Monolith vs Microservices

Embeddings

Streaming

Learning Path

Architecture Decisions

Core Service Patterns

AI-Specific Infrastructure

Performance & Scale

Graph & Data

All Articles

Case Study

Our Results

How We Built It

Lessons Learned

Stack

Frequently Asked Questions