Model Serving Interview Preparation Guide

Introduction

Model serving is the critical engineering discipline of deploying trained AI models as production-grade, low-latency, high-throughput inference services. While model training is a batch offline process, model serving must handle concurrent real-time requests under strict latency SLAs, often P99 < 200ms for LLM first-token latency, while maximising GPU utilisation to control infrastructure costs.

In 2026, model serving has become significantly more complex as organisations deploy large language models with 7B to 70B+ parameters. The serving stack now spans framework choices (vLLM, Triton, TGI), hardware configuration (tensor parallelism, pipeline parallelism), memory management (KV cache, PagedAttention), batching strategies (continuous batching, dynamic batching), and deployment topology (single-node vs multi-node, Ray clusters).

This topic is essential for MLOps Engineers, AI Platform Engineers, and any engineer responsible for putting models into production at scale.

Why It Matters

A model that achieves 95% accuracy in evaluation but takes 30 seconds per response is not useful in production. Model serving engineering bridges the gap between offline evaluation quality and online production performance. The wrong serving configuration can mean the difference between a 5x GPU cost overrun and a profitable inference service.

The shift to LLMs has made serving dramatically harder. LLM inference is memory-bound rather than compute-bound, the bottleneck is GPU memory bandwidth, not FLOPS. The KV cache for long-context requests can easily exceed 10GB per request on a 70B model. PagedAttention (vLLM) and continuous batching emerged specifically to address these constraints, enabling 10-20x higher GPU utilisation compared to naive static batching.

As an interview topic, model serving reveals operational depth. Understanding why dynamic batching degrades tail latency, how tensor parallelism splits model weights across GPUs, and when to use quantisation to fit a model on fewer GPUs demonstrates the infrastructure sophistication that AI platform roles require.

Core Concepts

Architecture Overview

The model serving execution model involves a request arriving via an API gateway, being queued, and then dispatched to a model runner that manages hardware resources.

Data Flow

Incoming requests are serialized, queued, batch-scheduled by the engine, processed on the GPU, and streamed back to the client.

Client Request
      ↓
 [API Gateway]
      ↓
 [Request Queue]
      ↓
 [Batch Scheduler]
      ↓
 [Model Runner]
      ↓
 [GPU/TPU Memory]
      ↓
 [KV Cache Manager]
      ↓
Response Stream

Key Components

Tools & Frameworks

Design Patterns

Sidecar Proxy Pattern Deployment Pattern

Deploying an inference server alongside a proxy for logging and auth.

Trade-offs: Adds network latency but simplifies security management.

Model Warm-up Pattern Performance Pattern

Running dummy inputs through the model before accepting live traffic.

Trade-offs: Increases startup time but prevents initial latency spikes.

Request Batching Pattern Optimization Pattern

Grouping multiple requests into a single tensor operation.

Trade-offs: Improves throughput but increases per-request latency.

Common Mistakes

Production Considerations

Reliability	Implement readiness and liveness probes for model servers (vLLM, Triton) that check both HTTP endpoint availability and GPU memory status. Use circuit breakers in the API gateway layer to prevent cascading failures when a model server is overwhelmed. Deploy at least 2 replicas per model to eliminate single points of failure.
Scalability	Horizontal pod autoscaling based on GPU duty cycle or request queue depth.
Performance	Target first-token latency P99 < 200ms for chat applications and throughput > 500 tokens/second/GPU for batch inference. Enable FlashAttention-2 for 2-3x attention speedup. Use continuous batching (vLLM) to maximise GPU utilisation. Quantise to FP8 or INT4 to reduce memory bandwidth bottleneck.
Cost	Use spot instances, model quantization, and multi-model sharing on single GPUs.
Security	Implement API gateways with JWT auth and rate limiting to prevent model abuse.
Monitoring	Track latency (P99), throughput (tokens/sec), and GPU memory utilization.

Key Trade-offs

•Latency vs Throughput

•Memory vs Precision

•Cost vs Availability

Scaling Strategies

•Horizontal Pod Scaling

•Multi-GPU Model Sharding

•Request Queue Sharding

Optimisation Tips

•Enable PagedAttention

•Use TensorRT-LLM kernels

•Quantize to INT8/FP8

FAQ

What is the difference between static and dynamic batching?

Static batching requires a fixed number of requests before processing, which can lead to latency if the buffer isn't filled. Dynamic batching processes requests as they arrive, filling available GPU slots immediately, which significantly improves throughput while maintaining lower latency.

Why is KV cache management critical for LLMs?

LLMs store intermediate key-value pairs to avoid recomputing previous tokens. As context length increases, the KV cache grows linearly or quadratically, consuming significant VRAM. Efficient management, like PagedAttention, prevents OOM errors and fragmentation.

Is REST or gRPC better for model serving?

gRPC is generally superior for high-throughput model serving due to its binary serialization (Protobuf) and native support for streaming, which is essential for token-by-token generation in LLMs, whereas REST is better for simple, stateless, low-frequency requests.

What is model quantization?

Quantization reduces the precision of model weights from FP32/FP16 to INT8 or 4-bit. This reduces memory usage and speeds up inference by utilizing specialized hardware instructions, though it may result in a slight loss of model accuracy.

How do you handle cold starts in model serving?

Cold starts are mitigated by pre-loading models into GPU memory, using warm-up routines with dummy data to compile kernels, and using container image optimization to reduce pull times in cloud environments.

What is the role of an inference engine?

An inference engine, such as Triton or vLLM, manages the execution of the model graph on the hardware. It handles operator scheduling, memory allocation, and kernel optimization to ensure the model runs as efficiently as possible.

How does model sharding work?

Model sharding splits a large model across multiple GPUs or nodes. Tensor parallelism splits individual layers across devices, while pipeline parallelism splits the model by layers across devices, both allowing models to run that exceed single-GPU memory.

What is a sidecar proxy in model deployment?

A sidecar proxy is a secondary container deployed alongside the model container. It handles non-inference tasks like authentication, logging, and metrics collection, allowing the model container to focus solely on high-performance inference.

Why is GPU memory management a top concern?

GPU memory (VRAM) is a finite and expensive resource. If a model and its KV cache exceed VRAM, the service will crash. Efficient management ensures maximum throughput without exceeding hardware limits.

What is the purpose of a readiness probe?

A readiness probe tells the load balancer when a model replica is fully loaded and ready to accept traffic. This prevents traffic from being routed to a pod that is still loading weights or compiling kernels.