Contents

Qwen3.5-Omni: Alibaba Surpasses Gemini-3.1 Pro on 215 Audio-Video Tasks

Launch Context

March 30, 2026—Alibaba Cloud’s Tongyi (通义) team releases Qwen3.5-Omni.

This is the first genuine full-modality unified model in the Tongyi lineup—not bolting together three separate models for speech recognition, image understanding, and video analysis, but unifying all modality representation spaces at the architectural level from the start.

Core Numbers

On a benchmark set of 215 audio-video understanding, recognition, and interaction tasks:

Model Tasks Avg SOTA Delta
Gemini-3.1 Pro 215 baseline
Qwen3.5-Omni 215 +2.3%

“Surpassing Gemini-3.1 Pro” means Qwen3.5-Omni scores 2.3 percentage points higher on average across all 215 tasks.

Not a single cherry-picked benchmark—large-scale systematic evaluation.

Significance of Full-Modality Unification

Previous multimodal models were mostly “text model + vision module” patchwork. Qwen3.5-Omni’s architectural approach puts text, images, audio, and video into the same embedding space from initial design.

Benefits:

  • No information loss during modality switching
  • More natural cross-modal reasoning (e.g., “generate background music matching the conversational tone in this image”)
  • Lower inference latency than patchwork approaches

Open Source Strategy

Qwen3.5-Omni will open-source model weights, but whether it’s fully open or partially open hadn’t been fully disclosed at time of writing.

Based on Qwen series patterns: expect an open-source base version + API access version.

Practical Impact for Developers

Among domestic (Chinese) multimodal models, Qwen3.5-Omni currently has the most impressive benchmark numbers. If open-sourced, it’s a more accessible choice than Gemini for domestic developers (latency, compliance, documentation all more familiar).

API pricing not yet announced, but based on Alibaba’s consistent strategy, expect it cheaper than equivalent OpenAI or Anthropic services.