Each test is 5 questions with varying difficulty.
AI Prep covers AI Agents, Generative AI, ML Fundamentals, NLP & LLMs and a lot more, with adaptive tests and daily challenges. Fully offline on Android. Free to try, one-time unlock for lifetime access.
Model serving is the critical engineering discipline of deploying trained AI models as production-grade, low-latency, high-throughput inference services. While model training is a batch offline process, model serving must handle concurrent real-time requests under strict latency SLAs, often P99 < 200ms for LLM first-token latency, while maximising GPU utilisation to control infrastructure costs.
In 2026, model serving has become significantly more complex as organisations deploy large language models with 7B to 70B+ parameters. The serving stack now spans framework choices (vLLM, Triton, TGI), hardware configuration (tensor parallelism, pipeline parallelism), memory management (KV cache, PagedAttention), batching strategies (continuous batching, dynamic batching), and deployment topology (single-node vs multi-node, Ray clusters).
This topic is essential for MLOps Engineers, AI Platform Engineers, and any engineer responsible for putting models into production at scale.
A model that achieves 95% accuracy in evaluation but takes 30 seconds per response is not useful in production. Model serving engineering bridges the gap between offline evaluation quality and online production performance. The wrong serving configuration can mean the difference between a 5x GPU cost overrun and a profitable inference service.
The shift to LLMs has made serving dramatically harder. LLM inference is memory-bound rather than compute-bound, the bottleneck is GPU memory bandwidth, not FLOPS. The KV cache for long-context requests can easily exceed 10GB per request on a 70B model. PagedAttention (vLLM) and continuous batching emerged specifically to address these constraints, enabling 10-20x higher GPU utilisation compared to naive static batching.
As an interview topic, model serving reveals operational depth. Understanding why dynamic batching degrades tail latency, how tensor parallelism splits model weights across GPUs, and when to use quantisation to fit a model on fewer GPUs demonstrates the infrastructure sophistication that AI platform roles require.
The model serving execution model involves a request arriving via an API gateway, being queued, and then dispatched to a model runner that manages hardware resources.
Incoming requests are serialized, queued, batch-scheduled by the engine, processed on the GPU, and streamed back to the client.
Client Request
↓
[API Gateway]
↓
[Request Queue]
↓
[Batch Scheduler]
↓
[Model Runner]
↓
[GPU/TPU Memory]
↓
[KV Cache Manager]
↓
Response Stream
Deploying an inference server alongside a proxy for logging and auth.
Trade-offs: Adds network latency but simplifies security management.
Running dummy inputs through the model before accepting live traffic.
Trade-offs: Increases startup time but prevents initial latency spikes.
Grouping multiple requests into a single tensor operation.
Trade-offs: Improves throughput but increases per-request latency.
| Reliability | Implement readiness and liveness probes for model servers (vLLM, Triton) that check both HTTP endpoint availability and GPU memory status. Use circuit breakers in the API gateway layer to prevent cascading failures when a model server is overwhelmed. Deploy at least 2 replicas per model to eliminate single points of failure. |
| Scalability | Horizontal pod autoscaling based on GPU duty cycle or request queue depth. |
| Performance | Target first-token latency P99 < 200ms for chat applications and throughput > 500 tokens/second/GPU for batch inference. Enable FlashAttention-2 for 2-3x attention speedup. Use continuous batching (vLLM) to maximise GPU utilisation. Quantise to FP8 or INT4 to reduce memory bandwidth bottleneck. |
| Cost | Use spot instances, model quantization, and multi-model sharing on single GPUs. |
| Security | Implement API gateways with JWT auth and rate limiting to prevent model abuse. |
| Monitoring | Track latency (P99), throughput (tokens/sec), and GPU memory utilization. |
Static batching requires a fixed number of requests before processing, which can lead to latency if the buffer isn't filled. Dynamic batching processes requests as they arrive, filling available GPU slots immediately, which significantly improves throughput while maintaining lower latency.
LLMs store intermediate key-value pairs to avoid recomputing previous tokens. As context length increases, the KV cache grows linearly or quadratically, consuming significant VRAM. Efficient management, like PagedAttention, prevents OOM errors and fragmentation.
gRPC is generally superior for high-throughput model serving due to its binary serialization (Protobuf) and native support for streaming, which is essential for token-by-token generation in LLMs, whereas REST is better for simple, stateless, low-frequency requests.
Quantization reduces the precision of model weights from FP32/FP16 to INT8 or 4-bit. This reduces memory usage and speeds up inference by utilizing specialized hardware instructions, though it may result in a slight loss of model accuracy.
Cold starts are mitigated by pre-loading models into GPU memory, using warm-up routines with dummy data to compile kernels, and using container image optimization to reduce pull times in cloud environments.
An inference engine, such as Triton or vLLM, manages the execution of the model graph on the hardware. It handles operator scheduling, memory allocation, and kernel optimization to ensure the model runs as efficiently as possible.
Model sharding splits a large model across multiple GPUs or nodes. Tensor parallelism splits individual layers across devices, while pipeline parallelism splits the model by layers across devices, both allowing models to run that exceed single-GPU memory.
A sidecar proxy is a secondary container deployed alongside the model container. It handles non-inference tasks like authentication, logging, and metrics collection, allowing the model container to focus solely on high-performance inference.
GPU memory (VRAM) is a finite and expensive resource. If a model and its KV cache exceed VRAM, the service will crash. Efficient management ensures maximum throughput without exceeding hardware limits.
A readiness probe tells the load balancer when a model replica is fully loaded and ready to accept traffic. This prevents traffic from being routed to a pod that is still loading weights or compiling kernels.
AI Prep covers AI Agents, Generative AI, ML Fundamentals, NLP & LLMs and a lot more, with adaptive tests and daily challenges. Fully offline on Android. Free to try, one-time unlock for lifetime access.