Each test is 5 questions with varying difficulty.
AI Prep covers AI Agents, Generative AI, ML Fundamentals, NLP & LLMs and a lot more, with adaptive tests and daily challenges. Fully offline on Android. Free to try, one-time unlock for lifetime access.
Quantization is the process of reducing the numerical precision of model weights and/or activations, from FP32 or BF16 down to INT8, INT4, or even binary formats, to shrink model memory footprint and accelerate inference. In 2026, quantization is an essential skill for any engineer deploying large language models: a 70B parameter model in BF16 requires ~140GB VRAM, making it undeployable on a single A100 80GB. INT4 quantization reduces this to ~35GB, fitting on a single GPU.
Quantization interview questions assess whether a candidate understands the accuracy vs efficiency tradeoff at the weights and activations level. Junior engineers are expected to know the difference between INT8 and INT4, and when to apply bitsandbytes vs GPTQ. Mid-level engineers must reason about calibration dataset selection, per-channel vs per-tensor granularity, and activation outlier handling. Senior engineers are assessed on GPTQ's Hessian-based compensation, AWQ's activation-aware scaling, and FP8 training with hardware-native support on H100 and H200 GPUs.
The economics of LLM deployment are dominated by GPU cost, and GPU cost is proportional to the number of GPUs needed to fit the model. A 70B model in BF16 requires 4× A100 80GB GPUs minimum. The same model quantized to INT4 (AWQ) fits on a single A100 with room for the KV cache. At scale, this represents a 4x infrastructure cost reduction for the same model capability.
Beyond memory, quantization improves throughput. LLM inference is memory-bandwidth-bound, the bottleneck is moving weights from GPU HBM to compute units. INT4 weights require 4x less bandwidth than BF16, enabling 2-4x higher tokens-per-second on the same GPU. This is why serving frameworks like vLLM integrate AWQ and GPTQ natively.
As an interview topic, quantization reveals hardware-aware engineering depth. Understanding why per-channel quantization preserves accuracy better than per-tensor, what causes activation outliers in LLMs and how SmoothQuant addresses them, and why calibration dataset quality matters for PTQ demonstrates the systems-level understanding that MLOps and AI infrastructure roles require.
Quantization transforms the model weights and activations from high-precision floating point to lower-precision integers, requiring specific kernels for dequantization during the forward pass.
Calibration data passes through the model to collect activation statistics, which are then used to compute optimal scales for weight quantization. During inference, quantized weights are loaded and dequantized on-the-fly by optimized kernels before matrix multiplication.
[Calibration Data]
↓
[Activation Statistics]
↓
[Weight Quantizer] ← [Scaling Factors]
↓
[Quantized Weights]
↓
[Inference Engine]
↓ ↓
[Dequant Kernel] [MatMul]
↓ ↓
[Output Activation]
Quantize weights to 4-bit while keeping activations in FP16 to maintain accuracy.
Trade-offs: Reduces memory footprint significantly but requires dequantization overhead.
Calculate separate scale factors for each output channel in a weight matrix.
Trade-offs: Higher accuracy than per-tensor but requires more complex kernel logic.
Use a small representative dataset to tune scale factors post-training.
Trade-offs: Faster than QAT but can be sensitive to the choice of calibration data.
| Reliability | Quantization can lead to silent failure where model outputs become nonsensical; implement rigorous evaluation on validation sets. |
| Scalability | Quantization enables horizontal scaling by fitting larger models on cheaper, smaller GPU instances. |
| Performance | Expect 2x-4x throughput improvements due to reduced memory bandwidth requirements. |
| Cost | Reduces cloud infrastructure costs by allowing smaller instance types (e.g., A10 vs A100). |
| Security | Quantized models may be more susceptible to adversarial attacks; ensure robust input sanitization. |
| Monitoring | Track 'quantization error' metrics and monitor latency distributions in production. |
PTQ (Post-Training Quantization) is applied to a pre-trained model without retraining, making it fast but potentially less accurate. QAT (Quantization-Aware Training) incorporates quantization effects during training, resulting in higher accuracy at the cost of significantly more compute and time.
4-bit quantization allows massive LLMs to fit into the limited VRAM of consumer-grade GPUs, enabling inference on hardware that would otherwise be unable to load the model. It provides a massive reduction in memory footprint with manageable accuracy degradation.
Calibration data is a small, representative subset of the training or validation data used to calculate optimal scale factors and zero-points. It ensures the quantized model maintains performance on real-world data distributions.
GPTQ uses second-order information (Hessian matrix) to compensate for quantization errors globally across weights. AWQ (Activation-Aware Weight Quantization) identifies and protects salient weights based on their activation magnitude, which is often more efficient for LLMs.
Group-wise quantization divides a weight tensor into smaller blocks (groups) and applies independent scale factors to each. This increases granularity, allowing for more precise mapping and reducing the accuracy loss common in low-bit quantization.
Quantization almost always involves some loss of precision. However, techniques like QAT, AWQ, and smaller group sizes can minimize this loss to negligible levels for most applications, balancing accuracy against memory and speed gains.
Symmetric quantization maps the floating point range symmetrically around zero to an integer range, simplifying hardware implementation. Asymmetric quantization uses a zero-point offset, allowing it to handle skewed distributions more effectively.
Standard matrix multiplication kernels expect FP16/BF16 inputs. Quantized kernels must perform on-the-fly dequantization or use specialized hardware instructions (like INT4/INT8 tensor core ops) to compute results efficiently without full dequantization.
Quantization reduces memory bandwidth pressure, which is often the primary bottleneck for LLM decoding. This leads to higher throughput and lower latency, provided the hardware and kernels are optimized for the specific quantization format.
The main risk is accuracy degradation, which can manifest as silent failures or nonsensical outputs. It is critical to perform rigorous validation and testing on domain-specific datasets before deploying quantized models.
AI Prep covers AI Agents, Generative AI, ML Fundamentals, NLP & LLMs and a lot more, with adaptive tests and daily challenges. Fully offline on Android. Free to try, one-time unlock for lifetime access.