Ai2 Introduces MolmoAct: A Groundbreaking AI Model Designed For 3D Spatial Reasoning

Introducing MolmoAct: A Revolutionary AI Model for 3D Spatial Reasoning and Robot Control

The Allen Institute for AI (Ai2) has unveiled MolmoAct 7B, a groundbreaking advancement in embodied artificial intelligence that bridges the gap between cutting-edge AI models and real-world applications. This first-of-its-kind model combines spatial planning and visual reasoning to enable safer, more adaptable robot control. Unlike traditional systems that rely on language-to-movement translation, MolmoAct processes its environment visually, comprehends spatial relationships, and plans movements accordingly. By generating visual reasoning tokens, it transforms 2D image inputs into actionable 3D spatial plans, empowering robots to navigate physical spaces with unprecedented intelligence and precision.

While spatial reasoning has been explored in robotics before, most existing systems depend on closed, end-to-end architectures trained on proprietary datasets. These models are often costly to scale, difficult to reproduce, and operate as “black boxes,” making them challenging to interpret or adapt. In contrast, MolmoAct takes a radically different approach. It is trained entirely on open data, designed for transparency, and optimized for real-world generalization. Its step-by-step reasoning process allows users to preview a robot’s planned actions and make intuitive adjustments in real time, ensuring adaptability in dynamic environments.

“Embodied AI requires a new foundation rooted in reasoning, transparency, and openness,” said Ali Farhadi, CEO of Ai2. “With MolmoAct, we’re not just releasing a model; we’re paving the way for a new era of AI that brings advanced reasoning capabilities into the physical world. This is a significant step toward creating AI systems that can reason and interact with their surroundings in ways that align with human cognition—safely and effectively.”

A New Paradigm: Action Reasoning Models (ARMs)

MolmoAct represents the debut of a novel class of AI models known as Action Reasoning Models (ARMs). These models interpret high-level natural language instructions and break them down into a sequence of logical, spatially grounded actions. Traditional robotics models typically treat tasks as singular, opaque processes. ARMs, however, deconstruct complex instructions into transparent, step-by-step decision-making chains:

3D-Aware Perception: Grounding the robot’s understanding of its environment using depth and spatial context.
Visual Waypoint Planning: Outlining a detailed task trajectory within the image space.
Action Decoding: Converting the plan into precise, robot-specific control commands.

This layered reasoning enables MolmoAct to execute intricate tasks like sorting a pile of trash by interpreting the command as a structured series of sub-tasks: recognizing objects, grouping them by category, grasping them individually, and repeating the process until completion.

Built for Generalization and Scalability

MolmoAct 7B, the inaugural model in its family, was trained on a meticulously curated dataset comprising approximately 12,000 “robot episodes” from real-world environments such as kitchens, bedrooms, and living rooms. These episodes were transformed into reasoning sequences that demonstrate how high-level instructions translate into goal-directed actions. For instance, the dataset includes videos of robots arranging pillows on a couch, organizing laundry in a bedroom, and performing other household tasks.

Remarkably, MolmoAct achieved impressive performance despite its efficient training regimen. The model required only 18 million samples, pretraining on 256 NVIDIA H100 GPUs for about 24 hours and fine-tuning on 64 GPUs for an additional two hours. By comparison, many commercial models demand hundreds of millions of samples and significantly more computational resources. Despite this, MolmoAct excels on key benchmarks, achieving a 71.9% success rate on SimPLER—a testament to the power of high-quality data and thoughtful design over sheer volume and compute.

Transparent and Adaptable AI

One of MolmoAct’s standout features is its transparency. Unlike conventional robotics models, which function as opaque systems, MolmoAct provides users with clear insights into its planned movements. Motion trajectories are overlaid on camera images, allowing users to preview actions before execution. Moreover, these plans can be adjusted through natural language commands or simple touchscreen sketches, offering granular control and enhancing safety in real-world settings like homes, hospitals, and warehouses.

True to Ai2’s commitment to openness, MolmoAct is fully open-source and reproducible. The institute has released all necessary components to build, run, and extend the model, including training pipelines, pre- and post-training datasets, model checkpoints, and evaluation benchmarks. This comprehensive release ensures that researchers and developers worldwide can leverage MolmoAct to advance the field of embodied AI.

Setting a New Standard for Embodied AI

MolmoAct establishes a new benchmark for what embodied AI should embody: safety, interpretability, adaptability, and openness. By enabling robots to reason in 3D space and interact intelligently with their surroundings, MolmoAct lays the groundwork for more capable and collaborative AI systems. Ai2 plans to continue testing the model across both simulated and real-world environments, refining its capabilities and expanding its applications.

As part of its mission to democratize AI research, Ai2 has made MolmoAct freely available to the public. Researchers and developers can download the model, training checkpoints, and evaluation tools from Ai2’s Hugging Face repository. With MolmoAct, Ai2 invites the global AI community to join in shaping the future of embodied intelligence.

In unveiling MolmoAct, Ai2 has not only introduced a transformative AI model but also signaled a shift toward a more transparent, inclusive, and impactful approach to robotics and embodied AI. This innovation marks a pivotal step toward AI systems that can truly understand and navigate the physical world alongside humans.

About Ai2

Ai2 is a Seattle-based non-profit AI research institute with the mission of building breakthrough AI to solve the world’s biggest problems. Founded in 2014 by the late Paul G. Allen, Ai2 develops foundational AI research and innovative new applications that deliver real-world impact through large-scale open models, open data, robotics, conservation platforms, and more. Ai2 champions true openness through initiatives like OLMo, the world’s first truly open language model framework, Molmo, a family of open state-of-the-art multimodal AI models, and Tulu, the first application of fully open post-training recipes to the largest open-weight models. These solutions empower researchers, engineers, and tech leaders to participate in the creation of state-of-the-art AI and to directly benefit from the many ways it can advance critical fields like medicine, scientific research, climate science, and conservation efforts. For more information, visit allenai.org.

Source link