BMI Calculator – Check your Body Mass Index for free!

Alibaba’s new open source Qwen3-235B-A22B-2507 beats Kimi-2 and offers low compute version

Chinese e-commerce giant Alibaba has made waves globally in the tech and business communities with its family of “Qwen” gen AI large language models (LLMs), beginning with the launch of the Tongyi Qianwen chatbot in April 2023, through the release of Qwen 3 in April 2025.

Well, not only are its models powerful and score high on third-party benchmark tests for math, science, reasoning and writing tasks, for the most part, they’ve been released under permissive open-source licensing terms, allowing organizations and enterprises to download, customize, run and generally use them for a variety of purposes, even commercial. Think of them as an alternative to DeepSeek.

This week, Alibaba’s Qwen Team released the latest updates to its Qwen family, and they’re already attracting attention from AI power users in the West for their top performance. In one case, they edged out the new Kimi-2 model from rival Chinese AI startup Moonshot, released in mid-July 2025.

The AI Impact Series Returns to San Francisco – August 5

The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.

Secure your spot now – space is limited: https://bit.ly/3GuuPLF

The new Qwen3-235B-A22B-2507-Instruct model — released on AI code sharing community Hugging Face alongside a “floating point 8” or FP8 version, which we’ll cover more in-depth below — improves on the original Qwen 3 in reasoning tasks, factual accuracy and multilingual understanding. It also outperforms Claude Opus 4’s “non-thinking” version.

The new Qwen3 model update also delivers better coding results, alignment with user preferences and long-context handling, according to its creators. But that’s not all…

Read on for what it offers enterprise users and technical decision-makers.

FP8 version lets enterprises run Qwen 3 with far less memory and compute

The “FP8” version’s 8-bit floating point compresses the model’s numerical operations to use less memory and processing power — without noticeably affecting its performance.

In practice, this means organizations can run a model with Qwen3’s capabilities on smaller, less expensive hardware, or more efficiently in the cloud. The result is faster response times, lower energy costs and the ability to scale deployments without needing massive infrastructure.

This makes the FP8 model especially attractive for production environments with tight latency or cost constraints. Teams can scale Qwen3’s capabilities to single-node GPU instances or local development machines, avoiding the need for massive multi-GPU clusters. It also lowers the barrier to private fine-tuning and on-premises deployments, where infrastructure resources are finite and total cost of ownership matters.

Even though Qwen’s team didn’t release official calculations, comparisons to similar FP8 quantized deployments suggest the efficiency savings are substantial. Here’s a practical illustration (updated and corrected on 07/23/2025 at 16:04 pm ET — this piece originally included an inaccurate chart based on a miscalculation. I apologize for the errors and thank readers for contacting me about them.):

Metric	BF16 / BF16-equiv build	FP8 Quantized build
GPU memory use*	≈ 640 GB total (8 × H100-80 GB, TP-8)	≈ 320 GB total on the recommended 4 × H100-80 GB, TP-4 Lowest-footprint community run: ~143 GB across 2 × H100 with Ollama off-loading
Single-query inference speed†	~74 tokens / s (batch = 1, context = 2 k, 8 × H20-96 GB, TP-8)	~72 tokens / s (same settings, 4 × H20-96 GB, TP-4)
Power / energy	Full node of eight H100s draws ~4-4.5 kW under load (550–600 W per card, plus host)‡	FP8 needs half the cards and moves half the data; Nvidia’s Hopper FP8 case-studies report ≈ 35-40 % lower TCO and energy at comparable throughput
GPUs needed (practical)	8 × H100-80 GB (TP-8) or 8 × A100-80 GB for parity	4 × H100-80 GB (TP-4). 2 × H100 is possible with aggressive off-loading, at the cost of latency

*Disk footprint for the checkpoints: BF16 weights are ~500 GB; the FP8 checkpoint is “well over 200 GB,” so the absolute memory savings on GPU come mostly from needing fewer cards, not from weights alone.

†Speed figures are from the Qwen3 official SGLang benchmarks (batch 1). Throughput scales almost linearly with batch size: Baseten measured ~45 tokens/s per user at batch 32 and ~1.4 k tokens/s aggregate on the same four-GPU FP8 setup.

‡No vendor supplies exact wall-power numbers for Qwen, so we approximate using H100 board specs and NVIDIA Hopper FP8 energy-saving data.

No more ‘hybrid reasoning’… instead, Qwen will release separate reasoning and instruct models

Perhaps most interesting, Qwen announced it will no longer be pursuing a “hybrid” reasoning approach, which it introduced with Qwen 3 in April. It seemed to be inspired by an approach pioneered by sovereign AI collective Nous Research.

This allowed users to toggle on a “reasoning” model, letting the AI model engage in its own self-checking and producing chains-of-thought (CoT) before responding.

In a way, it was designed to mimic the reasoning capabilities of powerful proprietary models such as OpenAI’s “o” series (o1, o3, o4-mini, o4-mini-high), which also produce “chains-of-thought.”

However, unlike those rival models which always engage in such “reasoning” for every prompt, Qwen 3 can have the reasoning mode manually switched on or off with a “Thinking Mode” button on the Qwen website chatbot. Or, users can type “/think” before their prompt on a local or privately run model inference.

The idea was to give users control to engage the slower and more token-intensive thinking mode for more difficult prompts and tasks, and use a non-thinking mode for simpler prompts. But again, this put the onus on the user to decide. While flexible, it also introduced design complexity and inconsistent behavior in some cases.

Now As Qwen wrote on X:

“After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible.”

With the 2507 update — an instruct or non-reasoning model, for now — Alibaba is no longer straddling both approaches in a single model. Instead, separate model variants will be trained for instruction and reasoning tasks, respectively.

The result is a model that adheres more closely to user instructions, generates more predictable responses and, as benchmark data shows, improves significantly across multiple evaluation domains.

Performance benchmarks and use cases

Compared to its predecessor, the Qwen3-235B-A22B-Instruct-2507 model delivers measurable improvements:

MMLU-Pro scores rise from 75.2 to 83.0, a notable gain in general knowledge performance.
GPQA and SuperGPQA benchmarks improve by 15–20 percentage points, reflecting stronger factual accuracy.
Reasoning tasks such as AIME25 and ARC-AGI show more than double the previous performance.
Code generation improves, with LiveCodeBench scores increasing from 32.9 to 51.8.
Multilingual support expands, aided by improved coverage of long-tail languages and better alignment across dialects.

The model maintains a mixture-of-experts (MoE) architecture, activating 8 out of 128 experts during inference, with a total of 235 billion parameters — 22 billion of which are active at any time.

As mentioned, the FP8 version introduces fine-grained quantization for better inference speed and reduced memory usage.

Enterprise-ready by design

Unlike many open-source LLMs, which are often released under restrictive research-only licenses or require API access for commercial use, Qwen3 is squarely aimed at enterprise deployment.

Boasting a permissive Apache 2.0 license, this means enterprises can use it freely for commercial applications. They may also:

Deploy models locally or through OpenAI-compatible APIs using vLLM and SGLang;
Fine-tune models privately using LoRA or QLoRA without exposing proprietary data;
Log and inspect all prompts and outputs on-premises for compliance and auditing;
Scale from prototype to production using dense variants (from 0.6B to 32B) or MoE checkpoints.

Alibaba’s team also introduced Qwen-Agent, a lightweight framework that abstracts tool invocation logic for users building agentic systems.

Benchmarks like TAU-Retail and BFCL-v3 suggest the instruction model can competently execute multi-step decision tasks — typically the domain of purpose-built agents.

Community and industry reactions

The release has already been well received by AI power users.

Paul Couvert, AI educator and founder of private LLM chatbot host Blue Shell AI, posted on X a comparison chart showing Qwen3-235B-A22B-Instruct-2507 outperforming Claude Opus 4 and Kimi K2 on benchmarks like GPQA, AIME25 and Arena-Hard v2, calling it “even more powerful than Kimi K2… and even better than Claude Opus 4.”

AI influencer NIK (@ns123abc) commented on its rapid impact: “Qwen-3-235B made Kimi K2 irrelevant after only one week, despite being one quarter the size, and you’re laughing.”

Meanwhile, Jeff Boudier, head of product at Hugging Face, highlighted the deployment benefits: “Qwen silently released a massive improvement to Qwen3… it tops best open (Kimi K2, a 4x larger model) and closed (Claude Opus 4) LLMs on benchmarks.”

He praised the availability of an FP8 checkpoint for faster inference, 1-click deployment on Azure ML and support for local use via MLX on Mac or INT4 builds from Intel.

The overall tone from developers has been enthusiastic, as the model’s balance of performance, licensing and deployability appeals to both hobbyists and professionals.

What’s next for Qwen team?

Alibaba is already laying the groundwork for future updates. A separate reasoning-focused model is in the pipeline, and the Qwen roadmap points toward increasingly agentic systems capable of long-horizon task planning.

Multimodal support, as seen in Qwen2.5-Omni and Qwen-VL models, is also expected to expand further.

And already, rumors and rumblings have begun as Qwen team members tease yet another update to their model family, with their web properties revealing URL strings for a new Qwen3-Coder-480B-A35B-Instruct model, likely a 480-billion parameter MoE with a token context of 1 million.

What Qwen3-235B-A22B-Instruct-2507 ultimately signals is not just another leap in benchmark performance, but a maturation of open models as viable alternatives to proprietary systems.

The flexibility of deployment, strong general performance and enterprise-friendly licensing give the model a unique edge in a crowded field.

For teams looking to integrate advanced instruction-following models into their AI stack — without the limitations of vendor lock-in or usage-based fees — Qwen3 is a serious contender.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

BMI Calculator – Check your Body Mass Index for free!

Subscribe to Updates

What's Hot

Alibaba’s new open source Qwen3-235B-A22B-2507 beats Kimi-2 and offers low compute version

Alibaba’s new open source Qwen3-235B-A22B-2507 beats Kimi-2 and offers low compute version

FP8 version lets enterprises run Qwen 3 with far less memory and compute

No more ‘hybrid reasoning’… instead, Qwen will release separate reasoning and instruct models

Performance benchmarks and use cases

Enterprise-ready by design

Community and industry reactions

What’s next for Qwen team?

Related Posts