What is Alibaba's Qwen3-Next-80B-A3B?
Is Open-weight AI catching up to closed-source?
Hey Everyone,
This will be a bit of a stub article. I wanted to acknowledge Qwen’s new release, but I don’t have much to say about it. The big ones I’m waiting for now are Gemini 3 and DeepSeek-R2.
The pace of iteration of open-weight models by Alibaba Cloud’s Qwen division is unusually fast and has been for quite some time. It’s hard to keep up.
On September 11th, 2025 they released a model with a weird name “Qwen3-Next-80B-A3B”.
Try it now: https://chat.qwen.ai
HuggingFace: https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d
ModelScope: https://modelscope.cn/collections/Qwen3-Next-c314f23bd0264a
Alibaba Cloud API: https://alibabacloud.com/help/en/model-studio/models#c5414da58bjgj
The reason I mention it is also the Qwen3-Next series are fairly good.
The Case for an Interference Efficiency Bump
These models use a Mixture of Experts (MoE) architecture with a total of 80 billion parameters, but only 3 billion are activated per token during inference, enabling significant efficiency gains: up to 10x faster inference speeds (especially for contexts longer than 32K tokens) and 10x lower training costs compared to dense models like Qwen3-32B, while maintaining or exceeding performance on downstream tasks.
According to their team:
80B params, but only 3B activated per token → 10x cheaper training,
10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)
Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall
Ultra-sparse MoE: 512 experts, 10 routed + 1 shared
Multi-Token Prediction → turbo-charged speculative decoding
Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context
Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.
Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.
Comments
Read comments on LinkedIn here.
Comments on Reddit here.
Comments on X here.
Comments on YouTube here.
Pretraining Efficiency and Inference Speed
It’s a bit hard to figure out which comments are legit though.
Prefill Stage: At 4K context length, throughput is nearly 7x higher than Qwen3-32B. Beyond 32K, it’s over 10x faster.
It feels like the Qwen team is getting a bit better at product-marketing though and there is some tangible progress here.
Decode Stage: At 4K context, throughput is nearly 4x higher. Even beyond 32K, it still maintains over 10x speed advantage.
What I think this signals is the pace of iteration and progress being made by Alibaba as a whole in AI. Qwen iterates faster than Google on new models and improvements. That’s not an easy thing to do, given Google DeepMind’s incredible scope of AI products and LLM progress in the 2024-2025 period.
The Alibaba stock is up 86% in 2025 on U.S. markets, which is a lot even with China stimulus.
One of the researchers at WhatsAI had this to say:
Is it Accessible?
Open-source on Hugging Face (Apache 2.0 license), with easy deployment via vLLM, SGLANG, or APIs from providers like OpenRouter and Together AI.
Community buzz is positive, with rapid adoption for local runs and fine-tuning. A "Thinking" API variant was highlighted for advanced apps like reranking or creative generation.
For evaluation, tools like EvalScope show it handles high-concurrency workloads well (e.g., 32 parallel sequences at 1024-token prompts).
As usual there wasn’t much coverage of this model launch in U.S. media publications! Though some European outlets covered it.
Gated DeltaNet
Linear attention is highly efficient in long context processing, but its recall ability is limited. Standard attention has high computational overhead and low inference efficiency. Using either of them alone has limitations.
To address this, the Qwen team introduced Gated DeltaNet, which outperforms the commonly used sliding window attention and Mamba2 in context learning ability. When adopting a 3:1 hybrid strategy (75% of the layers use Gated DeltaNet and 25% of the layers retain standard attention), it balances performance and efficiency.
Meanwhile, in the retained standard attention layers, they further introduced multiple optimization designs:
1. Continue the output gating mechanism of the previous work to alleviate the low-rank problem in attention;
2. Expand the dimension of a single attention head from 128 to 256;
3. Only add rotational position encoding to the first 25% of the dimensions of the attention head to enhance the long-sequence extrapolation ability. - Source.
There were a few notable mentions on Twitter/X:
Emad Mostaque, co-founder of the UK-based start-up Stability AI, said on X that Alibaba’s new model outperformed “pretty much any model from last year” despite an estimated training cost of less than US$500,000.
What other advantages besides cost could open-weight LLMs have? They could speed up China’s ability to build AI agents and AI products.
Multi-Token Prediction Mechanism
Qwen3-Next introduced the native Multi-Token Prediction (MTP) mechanism. It not only obtained the MTP module with a high acceptance rate of Speculative Decoding but also improved the overall performance of the model backbone.
Ten Times Faster but Ten Times Cheaper
Qwen3-Next uses a uniformly sampled subset of the Qwen3 36T pre-training corpus, containing only 15T tokens.
Artificial Analysis, a leading AI benchmarking firm, said Qwen3-Next-80B-A3B surpassed the latest versions of both DeepSeek R1 and Alibaba-backed start-up Moonshot AI’s Kimi-K2, which we have covered.
So it’s essentially world class and certainly doesn’t feel a year behind the best closed-source models of 2025.
Curiously of all companies, Nvidia wrote about it.
What surprised me the most is how much better and more efficient it was compared to previously models that were just months before by Qwen:
“Qwen3-Next-80B-A3B-Instruct significantly outperforms Qwen3-30B-A3B-Instruct-2507 and Qwen3-32B-Non-thinking, and achieves results nearly matching our flagship Qwen3-235B-A22B-Instruct-2507.”
I found this somewhat perplexity: (this is why I mention this model at all)
Because there is so little serious coverage and real analysis, it’s exceptionally hard to find info on these models that isn’t generic or sounds like PR copypasta. There are only a few dozen people who even talk about the more technical aspects.
See Benchmarks and Specs on HuggingFace.
Performance
So we’re looking at: “Qwen3-Next-80B-A3B” (far-right) compared to others:
It’s clearly optimized for Agentic AI. That’s the main value prop for winning in Open-weight LLMs. To be continued….














