With the widespread application of LLMs across various industries, achieving efficient inference while maintaining model performance has become a key challenge.
Introduction
Mainstream LLMs are based on the decoder-only architecture of transformers, with next token prediction as their core task. During inference, the model needs to generate each token sequentially, and this autoregressive generation characteristic increases inference latency. In terms of parameter count, models can be divided into different scales, and even small models have high requirements for computational and memory resources.
To address these challenges, academia and industry have proposed various optimization solutions. This article introduces several key technologies related to LLM inference acceleration. Corrections and suggestions are welcome.
Main Content
Inference Phase Overview
The inference phase is the process where a model receives input and produces output. Unlike the training phase, inference doesn’t require gradient computation and parameter updates, only forward propagation calculations.
During inference, due to the autoregressive generation characteristic, the model needs to generate tokens one by one. This leads to continuously increasing sequence lengths. For example:
- With an input length of 1000 tokens and needing to generate 100 new tokens, the actual sequence length processed is: 1000 + 1001 + … + 1100 = 106050 tokens
- Each new token needs to compute attention with all previous tokens
In a typical decoder-only architecture, there are multiple decoder blocks, and each token must pass through these blocks during inference. Each block contains 3 main computation modules: Self-Attention, FFN, and Layer Normalization (a basic component for ensuring numerical stability, which we’ll ignore in the following discussion).

The execution phases can be divided into prefill and decode stages:
Stage | Characteristics | Modules |
---|---|---|
Prefill | • Compute-intensive phase with O(n²) complexity due to full attention computation across all input tokens • Memory usage scales quadratically with sequence length due to attention matrix computation • Can leverage hardware parallelism as all tokens are processed simultaneously | • Self-Attention: Computes complete attention patterns between all input tokens, generating and caching K/V representations • FFN: Processes all input tokens through feed-forward networks in parallel |
Decode | • Linear complexity O(n) per token, but inherently sequential due to autoregressive nature • Memory growth is linear and predictable • Limited parallelization potential due to token-by-token generation | • Self-Attention: Generates Q for new token only, leverages cached K/V, computes attention scores efficiently • FFN: Processes only the newly generated token through feed-forward networks |
Demonstration of the prefilling stage (a) and decoding stage (b)

Illustration of the memory variation through time (latency) during inference

Performance metrics during inference include:
- TTFT (Time-To-First-Token): Generation time for the first token, mainly measuring Prefill stage performance
- TPOT (Time-Per-Output-Token): Time to generate each token, mainly measuring Decode stage performance
Let’s understand the inference process through a real example, query: “What are some recent practices related to Deepseek R1?”
Stage | Module | Processing | Result |
---|---|---|---|
Prefill | Self-Attention | Words like “happened” and “practices” will have high attention scores with “Deepseek R1”, producing an n×n attention matrix | Output the first token with highest probability |
FFN | Will search for pretrained knowledge about Deepseek in the learned knowledge base | ||
Decode | Self-Attention | Use the generated new token q to understand the input kv, discover it’s asking about practices, understand the need for practice-related knowledge | Output subsequent tokens with highest probability |
FFN | Will search for practice-related knowledge in the learned knowledge base |
In summary, key overheads during inference:
- Computational cost:
- Prefill needs to process all inputs, input has quadratic relationship with computation
- Model parameter count keeps increasing
- Decode has low GPU computation
- Storage cost:
- KV Cache grows linearly with input sequence length, needs to store K/V for all historical tokens in memory
- FFN storage overhead is large, estimated to account for over 70% of parameters
- IO cost: decode generates only one token each time but needs to repeatedly call the model
Technology Overview
The industry has proposed optimization solutions from different stages to address these challenges:

Data Level
Input Compression
Compress input text and prune low-quality information while minimizing performance impact to improve efficiency. Specific works include:
- LLMLingua Series: Train small models to prune and compress original promptsThe approach transforms compression into a token classification problem, deciding whether to keep or discard each token. The task model is trained on a dataset distilled from large models.imageimageimage
Other compression works in RAG scenarios:
RECOMP: Compresses retrieved documents into text summaries to improve language model performance while reducing computational costs.
imagexRAG: Projects document embeddings directly into LLMs’ representation space through modal fusion
imageConventional RAG, Rerank: Suitable for information-intensive scenarios, selecting highly relevant content to avoid garbage in garbage out
imageImpact of irrelevant passages on RAG effectiveness https://arxiv.org/pdf/2410.05983
Output Planning
Improve parallelism of long text output while minimizing performance impact to enhance efficiency. Specific works include:
SOT: SOT proposes an approach of parallel output after understanding intent, generating an outline for the answer, then generating in parallel for each point in the outline, significantly improving linear decoding performance. However, it cannot handle scenarios well where dependencies exist between different points. Based on this, they additionally trained a router model to judge whether dependencies exist, and parallel decoding is not performed when dependencies exist.
imageimageSGD: Takes a step further from SOT by abstracting dependency relationships into a DAG (points with dependencies have connecting lines; independent points are separate), thus further improving effectiveness.
image
Model Level
Model Structure Optimization
Attention Optimization: The core is reducing KV Cache and kernel function computations.
Adding Su Shen’s explanation: Why is reducing KV Cache size so important? As we all know, LLM inference is generally performed on GPUs, and single GPU memory is limited. One part needs to be used to store model parameters and activation values from forward computation, which depends on model size and is constant once the model is selected. Another part needs to be used to store the model’s KV Cache, which depends not only on model size but also on input length, growing dynamically during inference. When the context length is long enough, its size becomes dominant and may exceed the total memory of one card or even one machine (8 cards). The principle for deploying models on GPUs is: if it can be deployed on one card, don’t use multiple cards; if it can be deployed on one machine, don’t use multiple machines. This is because “intra-card communication bandwidth > inter-card communication bandwidth > inter-machine communication bandwidth”. Due to the “bucket effect”, the more devices the model deployment spans, the more it is “dragged down” by inter-device communication bandwidth. In fact, even though single H100 intra-card SRAM to HBM bandwidth has reached 3TB/s, this speed is still the bottleneck for inference with Short Context, not to mention slower inter-card and inter-machine communication. Therefore, the fundamental purpose of reducing KV Cache is to achieve inference of longer Context on fewer devices, thereby achieving faster inference speed and lower inference cost.
- MHA (Multi-Head Attention): The most basic multi-head attention mechanism, where each attention head has independent Q, K, V parameter matrices. Although it can fully capture information from different feature dimensions, it needs to store KV Cache for all heads, resulting in large storage overhead.
- MQA (Multi-Query Attention): When computing attention, shared K/V is broadcast to each attention head, then each head still uses its independent Q with shared K/V to compute attention scores. KV Cache size is reduced to 1/h of original (h is number of heads). However, sharing KV may limit the model’s ability to capture diverse features.
- GQA (Grouped-Query Attention): A compromise between MQA and MHA, dividing attention heads into groups with heads in the same group sharing KV. Compared to MQA it maintains better model performance while significantly reducing KV Cache compared to MHA. Representative works: DeepSeek V1, LLaMa 3.1, Qwen2.
- MLA (Multi-head Latent Attention): A new attention mechanism introduced by DeepSeek V2 through introducing latent variables. It significantly reduces KV cache size by compressing keys (Key) and values (Value) to a low-dimensional latent space. Compared to standard attention mechanisms, it reportedly reduces KV cache by about 93.3%.imageimage
MOE (Mixture-of-Expert): An advanced neural architecture that replaces the traditional monolithic FFN with a dynamic routing system. The architecture consists of:
- A gate network that learns to route tokens to the most relevant expert networks
- Multiple specialized expert networks (typically 8-32) that process different aspects of the input
- A load balancing mechanism to ensure efficient utilization of experts
This design offers several advantages:
- Increased model capacity without proportional increase in computation cost
- Dynamic specialization where different experts can focus on different types of tokens/tasks
- Improved inference efficiency through selective expert activation
The approach originated from the 1991 paper Adaptive Mixture of Local Experts and has seen renewed interest in modern LLMs. Starting with Mistral’s Mixtral 8x7B, MOE has become a mainstream trend in 2024, with DeepSeek’s implementations showcasing increasingly sophisticated architectures across their model versions. Key challenges in MOE implementation include:
- Balanced token routing to prevent expert overload
- Ensuring expert specialization without knowledge conflicts
- Maintaining routing efficiency at inference timeimageimage
Model Compression
Model compression aims to reduce resource overhead by reducing model complexity (including parameter count, computation, and memory usage) while maintaining model performance as much as possible. Some representative measures include:

Quantization: Quantization reduces the precision of model weights and activations. Most models are trained with 32 or 16-bit precision, where each parameter and activation element occupies 32 or 16 bits of memory (single precision floating point). Obviously, we can try to use fewer bits to represent weights. Common methods include post-training quantization, directly converting numerical precision after model training, simple operation but may reduce accuracy; and quantization-aware training, simulating quantization effects during training to let the model adapt to low precision in advance, resulting in less accuracy loss. Benefits include significantly reducing model size and improving computation speed. Focus on activation value quantization in prefill stage; focus on weight quantization in decode stage.
imageKnowledge Distillation: Knowledge distillation is a compression method that transfers knowledge from large models (teacher models) to small models (student models). Common methods include post-training distillation, directly using teacher model to generate data to train student model, which can compress the model to 40-60% of original size while maintaining most performance.
imagePruning: Compress model by removing unimportant connections or structures. Common methods include structured pruning (removing entire layers or heads) and unstructured pruning (removing individual weights), can reduce parameters by 30-50% with minimal performance loss.
Low-rank Decomposition: Decompose large weight matrices into products of several small matrices to reduce parameter count and computation. Common methods include SVD decomposition and LoRA etc. Among them, LoRA achieves parameter-efficient fine-tuning by adding low-rank matrices beside original weights, becoming one of the mainstream model compression methods.
Decoding Optimization
The most time-consuming part of inference stage is the autoregressive generation process. Decoding optimization aims to accelerate token generation through parallelization or prediction. Some representative measures include:
Speculative Decoding: An innovative decoding technique for autoregressive large models aimed at improving decoding efficiency without affecting output quality. The core idea includes using a smaller model (Draft Model) to effectively predict several subsequent tokens, then using the large model to verify these predictions in parallel. This method aims to enable the large model to generate multiple tokens in the time range usually required for a single inference.
imageSkip-token Decoding: Representative work SGLang, core idea is to skip predictable parts during generation by analyzing deterministic features of text structure. Instead of generating token by token, it can directly jump to the next position requiring inference at positions with high determinism, greatly improving generation efficiency.
imageConstrained Decoding: Representative work Outlines, Outlines converts JSON Schema into regular expressions, then builds finite state machines (FSM) based on these regular expressions, then filters tokens in real-time during generation to ensure output always conforms to predefined format requirements. This approach is particularly suitable for scenarios requiring strict output format control.
imageStructured Decoding: Representative work Guidance, innovation lies in dividing generation tasks into fixed structure and dynamic content categories and adopting different processing strategies. By identifying template parts in text, it can significantly reduce the amount of content that needs to be actually generated, thereby improving overall efficiency.
image
System Level
KV Cache
KV Cache optimization is critical for LLM inference efficiency as it represents a significant memory bottleneck. The optimization focuses on two key aspects:
Sparsification Compression:
- Approach: Strategically select and retain only the most informative tokens while pruning less important ones
- Implementation: Uses attention scoring to identify key tokens
- Example:
Original sequence (dense):
[T1, T2, T3, ..., T128K] // Full 128K token sequence
After sparsification (sparse):
[T1, T512, T1024, ..., T128K] // Only ~1K tokens retained
- Retains tokens with high attention scores
- Maintains semantic coherence
- Reduces memory footprint by >90%
Quantization Compression:
- Approach: Reduce numerical precision of stored K/V values
- Implementation methods:
- Post-training quantization (e.g., FP16 → INT8)
- Dynamic quantization during inference
- Mixed-precision storage strategies
- Benefits:
- 2-4x memory reduction with minimal accuracy impact
- Improved cache hit rates
- Better hardware utilization
These optimization strategies can be combined and tuned based on specific requirements:
- Long-context scenarios benefit more from sparsification
- Latency-sensitive applications might prefer quantization
- Production systems often use a hybrid approach
The optimization process typically follows this workflow:

PagedAttention
Traditional inference frameworks adopt static memory allocation strategy: pre-allocating fixed size memory blocks according to batch_size × max_seq_len, resulting in low memory utilization.
// Traditional static allocation |
// PagedAttention dynamic allocation |

Prefix Cache
Limitation of traditional PagedAttention: KV Cache can only be reused within single request, cannot be shared across requests, which is inefficient in multi-round dialogue scenarios. Example:
// Traditional way |
- Cache retention
After request completion:
KV Cache → Retained in GPU memory
Token sequence → Stored in RadixTree - Prefix matching
New request: "Hello, please introduce the Forbidden City in Beijing"
↓
RadixTree matching
↓
Hit: KV Cache of "Hello, please introduce Beijing"
↓
Only need to calculate new part: "the Forbidden City" - Cache managementThis optimization is particularly suitable for scenarios like multi-round dialogues, can significantly improve model response speed and resource utilization efficiency.
Strategy: LRU (Least Recently Used)
When memory is insufficient:
- Identify least recently used KV Cache
- Prioritize releasing this part of memory
Others
There are many other optimization approaches at system level, such as tensor parallelism, pipeline parallelism, distributed inference etc. Since these optimization solutions involve relatively complex system architecture design, we won’t discuss them in detail here. Interested readers can refer to relevant technical documentation for deeper understanding.
Conclusion
Through in-depth discussion of LLM inference acceleration technologies, we can see:
- Technical Architecture and Challenges
- Inference process divided into two main stages: Prefill and Decode
- Core challenges:
- Computational cost: input scale, parameter count, autoregressive nature
- Storage cost: KV Cache, FFN parameter storage
- IO cost: frequent model calls
- Optimization Solutions (Multi-level Technology Stack)
- Data level optimization:
- Input compression: LLMLingua, RECOMP, xRAG etc.
- Output planning: SOT, SGD and other parallel output technologies
- Model level optimization:
- Structure optimization: attention mechanisms (MHA, MQA, GQA, MLA), MOE architecture
- Model compression: quantization, knowledge distillation, pruning, low-rank decomposition
- Decoding optimization: speculative decoding, skip-token decoding, constrained decoding etc.
- System level optimization:
- KV Cache optimization: sparsification compression, quantization compression
- PagedAttention: dynamic memory management
- Prefix Cache: cross-request KV reuse As LLMs continue to expand their applications across various fields, inference acceleration technologies also continue to evolve. From early simple optimizations to current multi-level comprehensive optimization solutions, technology keeps breaking through. The development of these optimization technologies enables LLMs to serve practical application scenarios more efficiently, laying an important foundation for the popularization of AI applications. In the future, with the advancement of hardware technology and algorithm innovation, we expect to see more breakthrough inference acceleration solutions.
- Data level optimization:
References:
- A Survey on Efficient Inference for Large Language Models Very helpful, recommended
- Model Compression and Efficient Inference for Large Language Models: A Survey
- Mastering LLM Techniques: Inference Optimization
- AI Large Model Inference Process and Optimization Technologies
- How to make LLMs go fast
- Yidoo Blog
- Detailed Discussion on DeepSeek MoE Related Technical Development
- The Ultimate Trade-off Between Cache and Effect: From MHA, MQA, GQA to MLA
- A Survey on Model Compression for Large Language Models
Feel free to share articles from this site, please credit the author and source SimonAKing.