
No ratings yet
Be the first to review this model
Microsoft's unified multimodal model that seamlessly integrates speech, vision, and text processing in a single 5.6B parameter architecture. Processes text, image, and audio inputs to generate text outputs with 128K context. A breakthrough in compact multimodal AI for edge and mobile deployment.
Released
February 27, 2025
Parameters
5.6B
Context
128K
Pricing
Open Source