Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

Coinmama
Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding
Coinbase



As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model's weights.

Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model's existing architecture.

The limits of next-token prediction

Next-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “chain of thought” tokens before producing the final response, leading to a slow and expensive user experience.

Multi-token prediction (MTP) offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a single forward pass.  For example, the model can be trained to predict a block of tokens all at once instead of just the immediate next token.

bybit

John Kirchenbauer, a doctorate candidate in computer science at the University of Maryland and co-author of the paper, told VentureBeat that as we move toward agentic workflows, the focus is shifting from overall throughput to single-user speed. "Today, with ultra-long thinking traces being the norm and agentic outer loops multiplying out those costs even further, latency is becoming as equally important a dimension of overall serving efficiency as gross tokens per second per hardware unit (tps/GPU)," Kirchenbauer said. He said that while standard batched next-token prediction is already optimal for overall throughput, the new approach "strive[s] to saturate the GPU with just a single user's query to decrease latency for that single user."

Other methods exist, but they come with drawbacks. "It's worth noting that speculative decoding, and diffusion LLMs as an efficiency focused alternative to next token prediction (NTP), are both latency focused acceleration techniques," Kirchenbauer said. But speculative decoding requires deploying and managing an auxiliary "drafting" model, which spends more absolute compute to draft and verify. MTP, on the other hand, "leverages a similar sort of tradeoff, it's just simpler to serve and scientifically interesting in its own right."

Current MTP paradigms have limitations, however. The standard objective for training a language model for MTP involves comparing its predictions against ground-truth text from a dataset. The pitfall is that this standard training teaches the model to predict the probability of a token at a specific position independently, rather than caring about the joint relationship between a sequence of tokens.

If a model tries to predict multiple tokens at once using this standard method, two major problems occur. The first is grammatical mismatch. For example, if a model predicts two words following the prefix "The zookeeper fed the," it might sample independently and produce a mismatched phrase like "panda meat" or "lion bamboo" instead of "panda bamboo" and “lion meat.”

The second issue is degenerate repetition. Because typical text is unpredictable, a model trying to predict a token 100 positions into the future against a standard dataset will just predict "the," since it is the most common word in English. This results in the model outputting nonsense like "…the the the…" for far-future positions.

Multi-token prediction via self-distillation

To solve the issues of generating multiple tokens, the researchers propose a novel training technique that uses a student-teacher scheme. A student model, which is the model learning to predict multiple tokens, generates a deterministic multi-token block. A teacher model, acting as a strong standard next-token prediction language model, evaluates that block. The teacher acts as a critic, calculating how likely and coherent the student's proposed sequence is. If the student proposes a mismatched phrase like "lion bamboo," the teacher assigns it a high loss, teaching the student to avoid that construction.

The paradigm is inspired by on-policy reinforcement learning because the student model is not simply memorizing static text. It generates a full rollout (sequence of actions in RL parlance) instantly in parallel on a single forward pass and receives a reward based on how good the teacher thinks it is. Unlike static supervised methods where training pairs are fixed in advance, the feedback here is dynamic, generated from the student's own outputs in real time. The strong teacher also verifies the coherence of the tokens, which prevents the student model from learning degenerate outputs like repeated words.

For developers, the beauty of this approach lies in its simplicity. "There are truly no modifications to the architecture except for the addition of a special token," Kirchenbauer said. By co-opting an unused slot in a model's existing embedding matrix to act as an <MTP> mask token, the technique converts sequential operations into parallel ones. "Any standard next token prediction language model can be adapted in this way… the internal implementation — MoE, windowed attention, SSM layers, etc. — are left untouched and present no barrier to adaptation."

For engineering teams, this means the adaptation can be applied to models already in production without rebuilding pipelines.

Generating multiple tokens at the same time can still hurt the accuracy of the response at inference time. To maximize generation speed without sacrificing the quality of the output, the authors introduce an adaptive decoding strategy called ConfAdapt.

ConfAdapt evaluates a confidence threshold, such as 90%, at each step. The model generates a block of tokens, but it only keeps the tokens that meet or exceed this high-confidence threshold. When the upcoming text is highly predictable or structural, the model's confidence is very high. It will accept and output a large chunk of tokens all at once, saving significant computational time on easy tokens. It then focuses its costly single-token passes on harder tokens that require more computational effort.

Putting multi-token prediction to the test

To see how the training paradigm performed in practice, the researchers applied their method to popular open-weight instruction-tuned models. They tested the strong general-purpose model Llama-3.1-8B-Magpie and the smaller, efficient Qwen3-4B-Instruct-2507, which is often chosen for cost-sensitive enterprise deployments. Both models were tuned on MetaMathQA, a dataset of synthetic grade school math problems that rely heavily on reasoning traces.

The experiments revealed a clear sweet spot between speed and accuracy. Using the ConfAdapt strategy, the Llama-3.1-8B model achieved a 3x speedup with less than a 3% drop in accuracy on math benchmarks. The Qwen3-4B model achieved the same 3x speedup with a slightly higher 7% drop in accuracy. More aggressive settings could hit 5x speedups, though they came with steeper accuracy penalties.

How this translates to real-world tasks depends on predictability. "As the ConfAdapt approach naturally tailors the acceleration to the inherent entropy in the domain, when the model 'knows' exactly what comes next it can emit it in a single pass," he noted, leading to massive acceleration on predictable tasks, while using more steps for uncertain outputs.

The speedups also transferred across domains that were not included in the multi-token prediction training phase. This included tasks within the same domain as the training data, like math and reasoning, as well as open-ended tasks such as creative writing and summarization.

Despite this transfer learning, enterprises deploying these models for specialized tasks shouldn't rely on it entirely. "Our recommendation would be to tune/adapt the model for MTP using samples from the special industrial domain," Kirchenbauer said. "The best performance is likely achieved if the MTP adaptation is performed using prompts from the deployment domain."

Serving compatibility and the road ahead

The research team released their trained models on Hugging Face and will soon release the code for their MTP framework. Infrastructure teams integrating these models into vLLM or SGLang will need to account for changes in how batching and KV caching are handled — but that's a one-time engineering investment, not an ongoing burden. However, Kirchenbauer sees "no clear barriers to integration" and confirmed the team is "working with some systems experts to identify the shortest path to integration."

Kirchenbauer's advice for teams wanting to test the released models: start with toy prompts like counting or repeating a phrase to see ConfAdapt's gains in action, then adapt the model using samples from your specific deployment domain for best results. "Overall we do expect that a production-ready implementation of our approach could simplify the lifecycle of building and deploying low-latency agentic models," Kirchenbauer concluded. "While existing acceleration techniques for NTP models focus almost solely on inference harnesses and logic, our approach just bakes some of the complexity into the model itself making it largely complementary to existing work."



Source link

Blockonomics

Be the first to comment

Leave a Reply

Your email address will not be published.


*