Qwen3.5-Omni: Alibaba Surpasses Gemini-3.1 Pro on 215 Audio-Video Tasks
Launch Context
March 30, 2026—Alibaba Cloud’s Tongyi (通义) team releases Qwen3.5-Omni.
This is the first genuine full-modality unified model in the Tongyi lineup—not bolting together three separate models for speech recognition, image understanding, and video analysis, but unifying all modality representation spaces at the architectural level from the start.
Core Numbers
On a benchmark set of 215 audio-video understanding, recognition, and interaction tasks:
| Model | Tasks | Avg SOTA Delta |
|---|---|---|
| Gemini-3.1 Pro | 215 | baseline |
| Qwen3.5-Omni | 215 | +2.3% |
“Surpassing Gemini-3.1 Pro” means Qwen3.5-Omni scores 2.3 percentage points higher on average across all 215 tasks.
Not a single cherry-picked benchmark—large-scale systematic evaluation.
Significance of Full-Modality Unification
Previous multimodal models were mostly “text model + vision module” patchwork. Qwen3.5-Omni’s architectural approach puts text, images, audio, and video into the same embedding space from initial design.
Benefits:
- No information loss during modality switching
- More natural cross-modal reasoning (e.g., “generate background music matching the conversational tone in this image”)
- Lower inference latency than patchwork approaches
Open Source Strategy
Qwen3.5-Omni will open-source model weights, but whether it’s fully open or partially open hadn’t been fully disclosed at time of writing.
Based on Qwen series patterns: expect an open-source base version + API access version.
Practical Impact for Developers
Among domestic (Chinese) multimodal models, Qwen3.5-Omni currently has the most impressive benchmark numbers. If open-sourced, it’s a more accessible choice than Gemini for domestic developers (latency, compliance, documentation all more familiar).
API pricing not yet announced, but based on Alibaba’s consistent strategy, expect it cheaper than equivalent OpenAI or Anthropic services.