The tech industry spent the last two years convinced that the only way to make AI faster was to reinvent the computer chip. Startups like Groq and Cerebras raised billions of dollars to build massive, specialized hardware designed to solve the data bottlenecks that slow down models like ChatGPT. The narrative was simple: standard graphics chips from Nvidia were fine for training AI, but they were too slow for the split-second responses needed in the real world. This belief turned the hunt for custom silicon into a digital gold rush.
Xiaomi just proved that theory is wrong. On Monday morning, the Chinese electronics giant released a new serving mode for its flagship model, MiMo-V2.5-Pro-UltraSpeed. It did more than just break a speed record. It shattered the ceiling for what we thought was possible on standard, off-the-shelf hardware. The system reached speeds of 1,200 tokens per second on a trillion-parameter model. For context, a token is roughly three-quarters of a word. This means the model generates about 900 words every single second.
Looking at the big picture, this is 15 times faster than the versions of GPT and Claude that most people use today. Xiaomi did this using a standard 8-GPU node—the same kind of hardware you can rent from any major cloud provider. This development suggests that the secret to the next generation of AI speed is not a better factory for chips. It is a smarter way to use the chips we already have.
To understand why this matters, we have to look at how humans experience AI speed. When you ask ChatGPT or Claude a question, the text usually appears at the pace of a fast typist. That is roughly 60 to 80 tokens per second. While this feels fast to a person reading a single response, it is far too slow for complex industrial tasks. High-speed AI is the invisible backbone for things like real-time translation, instant fraud detection in banking, and autonomous agents that must make thousands of decisions per minute.
Historically, the fastest speeds came from custom hardware. Cerebras made headlines by hitting nearly 1,000 tokens per second on a Meta model, but that required a chip the size of a dinner plate. Xiaomi reached that same threshold—and then passed it—on a model that is more than twice as large.
| Model | Tokens per Second | Hardware Type |
|---|---|---|
| MiMo-V2.5-Pro-UltraSpeed | 1,200 | Standard GPUs |
| Gemini Flash | 192 | Google TPU (Custom) |
| Claude Haiku | 98 | Standard Cloud GPUs |
| Claude Opus 4.6 | 71 | Standard Cloud GPUs |
| GPT-5.5 | 68 | Standard Cloud GPUs |
Under the hood, Xiaomi used a technique called FP4 quantization on the model's expert layers. To explain this in simple terms, imagine a model with a trillion parameters is a massive library. Usually, the computer has to read every word in every book to give you an answer. This takes a lot of memory and time. Quantization is a way of shrinking those books so they take up less space.
Many companies try to shrink the entire library, but this often makes the AI less intelligent and more prone to errors. Xiaomi was surgical. They kept the core logic of the model at high resolution but compressed the specialized expert layers—the specific departments of the library—down to 4-bit precision. This reduced the amount of data the chip had to move by half. The result is a model that keeps its high IQ while moving twice as fast through the computer's memory.
There is also a second trick called DFlash speculative decoding. In a typical AI conversation, the model is like a writer who has to think about every single letter before typing it. Speculative decoding introduces a tireless intern who tries to guess the next few words. If the intern is right, the model accepts the whole block of text at once. If the intern is wrong, the model fixes it. Xiaomi’s DFlash is so efficient that it proposes eight tokens at a time and usually gets six of them right. This allows the model to leap forward in chunks rather than crawling one word at a time.
Software efficiency is often about removing the empty spaces in a process. Xiaomi paired their model with a new inference engine called TileRT. In most AI systems, there is a tiny delay every time the software tells the hardware to perform a new calculation. These gaps are measured in microseconds, but they add up when you are performing billions of calculations.
TileRT keeps the entire compute process inside the GPU memory at all times. It eliminates the "start and stop" nature of traditional AI processing. This streamlined approach ensures that the graphics chips are never sitting idle, waiting for the next instruction. This combination of compressed data, lucky guessing, and a gapless pipeline is what allows a standard server to perform like a multi-million dollar custom supercomputer.
For the average user, these speed records might seem like abstract corporate competition. However, the impact on consumer tech is tangible. When AI is this fast, it changes from a chatbot you talk to into a tool that works for you in the background.
Consider a real-time language translation app. Current speeds often have a noticeable lag that makes natural conversation difficult. At 1,000 tokens per second, an AI could listen to a full sentence, translate it into three different languages, and check the grammar of all three in less time than it takes for you to blink. This eliminates the awkward pauses in cross-border business meetings or travel.
On the market side, this is a disruptive move for the cost of AI. Xiaomi is pricing this UltraSpeed trial at three times their standard rate, but they are providing ten times the output. For developers building new apps, this means they can get much more work done for every dollar they spend on cloud computing. Lower costs for developers usually lead to cheaper or more capable apps for the end user.
Xiaomi’s success suggests that the hardware shortage of the last few years might have been a software problem in disguise. As companies realize they can get massive performance gains through better coding, the pressure to buy the most expensive, specialized chips may begin to fade. We are moving toward a period where the efficiency of the math matters as much as the power of the silicon.
You should expect to see a wave of real-time AI features hitting your devices by the end of this year. These will not be just faster chatbots. Look for features that require the AI to think through dozens of possibilities at once, such as advanced coding assistants that write entire programs in seconds or gaming characters that have unscripted, instant conversations. The bottleneck is no longer how fast the computer can think. It is how fast we can give it something useful to do.
Sources:
Xiaomi MiMo Developer Documentation (April 2026)
Artificial Analysis LLM Leaderboard (June 2026)
TileRT Technical Whitepaper (May 2026)
Cerebras and Groq Performance Benchmarks (2025)



Our end-to-end encrypted email and cloud storage solution provides the most powerful means of secure data exchange, ensuring the safety and privacy of your data.
/ Create a free account