Xiaomi has developed a 4.7 billion-parameter AI model that combines visual perception, speech, and control for robots.
Xiaomi Enters the Robotics Market
The Chinese mobile device and smart home giant known as Xiaomi has announced a new step: developing its own artificial intelligence model for robots. The company introduced Xiaomi‑Robotics‑0, an open-source system that combines visual recognition, language understanding, and real-time action control. The model boasts 4.7 billion parameters and has already set several records both in simulations and on the ground.
How the Model Works
A robot typically follows a “perception → decision → action” cycle. Xiaomi‑Robotics‑0 balances broad situational awareness with precise motor control thanks to its Mixture‑of‑Transformers (MoT) architecture.
1. Visual‑Language Model (VLM) – the system’s “brain.”
* Trained to interpret commands, even vague ones (“please fold the towel”).
* Understands spatial relationships based on high‑quality images.
* Tasks: object detection, answering visual questions, and logical reasoning.
2. Action Expert – a motion generator.
* Built on a diffusion transformer (DiT).
* Does not generate one action at a time; it constructs a sequence of actions through flow matching, ensuring smoothness and accuracy.
Training Without Losing Comprehension
Typical VLMs lose some perception skills when trained on physical tasks. Xiaomi addressed this by simultaneously training the model on multimodal data (images + text) and action data. The training process consists of several stages:
1. Action Proposal – the VLM predicts possible action distributions over images, synchronizing its internal representation with real operations.
2. After that, the VLM “shuts off,” and DiT undergoes separate training to generate precise sequences from noise, relying on key features rather than language tokens.
Minimizing Latency
To eliminate pauses between model predictions and actual robot movements, asynchronous output is used: AI computations and robot actions are separated. This allows robots to move continuously even when additional calculations are needed.
* Clean Action Prefix – a method for returning previously predicted actions, ensuring smoothness without jerks.
* Attention masking focuses on the current visual stream, ignoring past states, making the robot more responsive to sudden environmental changes.
Results
In simulation environments LIBERO, CALVIN, and SimplerEnv, Xiaomi‑Robotics‑0 outperformed about 30 competitors. On a real two‑arm robot, the model successfully handled complex tasks such as towel folding and disassembling a construction set. The robot demonstrated stable hand–eye coordination, manipulating objects effectively across various scenarios.
Thus, Xiaomi not only expanded its product portfolio but also laid the groundwork for further research in the field of “physical intelligence” for robots.
Comments (0)
Share your thoughts — please be polite and stay on topic.
Log in to comment