What Is Inference?
Model Inference
Inference is the process of running a trained AI model to make predictions or generate outputs on new data. Training teaches the model; inference uses the model. When you send a prompt to ChatGPT and get a response, that's inference.
How Inference Works
Training and inference have very different requirements. Training needs massive compute (GPU clusters, days-weeks). Inference is much cheaper — a single GPU or even CPU can run most models. But inference speed and cost matter at scale when you're serving millions of requests.
Key Concepts
- Latency — How long inference takes — measured in milliseconds or tokens per second for LLMs
- Batch Inference — Processing multiple inputs at once for higher throughput — common for offline data processing
- Edge Inference — Running models on devices (phones, IoT) rather than cloud servers — requires smaller, optimized models
Frequently Asked Questions
What is the difference between training and inference?
Training teaches the model by adjusting its weights on training data (expensive, done once). Inference uses the trained model to make predictions on new data (cheap per request, done millions of times).
How do I make inference faster?
Use quantization (reduce precision from 32-bit to 8-bit), model distillation (train a smaller model), batching, caching, and GPU acceleration. For LLMs, tools like vLLM and TensorRT optimize inference speed.