The most capable open-source AI model with vision capabilities to date could inspire more developers, researchers, and startups to develop AI agents that can perform useful tasks on your computer.
The Allen Institute for AI (Ai2) announced today a multimodal open language model called Molmo that can not only interpret images but also converse through a chat interface, meaning it can understand a computer screen and potentially help AI agents perform tasks such as browsing the web, navigating file directories, and writing documents.
“This release makes it possible for more people to adopt multimodal models,” said Ali Farhadi, CEO of Seattle, Wash.-based research firm Ai2 and a computer scientist at the University of Washington. “This is going to enable the next generation of apps.”
So-called AI agents are widely touted as the next big thing in AI, with OpenAI, Google and others racing to develop them. Agents are a buzzword these days, but the grand vision for AI is to go far beyond chatting, to reliably carry out complex and sophisticated actions on computers when given commands – a capability that has yet to be realized at any scale.
There are several powerful AI models that already have vision capabilities, such as OpenAI’s GPT-4, Anthropic’s Claude, Google DeepMind’s Gemini, etc. These models can be used to power some experimental AI agents, but they are hidden and only accessible via paid application programming interfaces (APIs).
Meta has released a family of AI models called Llama under a license that restricts commercial use, but has not yet made a multimodal version available to developers. Meta is set to announce several new products at its Connect event today, possibly including new Llama AI models.
“Having an open source multimodal model means that any startup or researcher with an idea can make it happen,” said Ophir Press, a postdoctoral researcher at Princeton University who studies AI agents.
Press says that Molmo’s open source nature means that developers can more easily fine-tune the agent for specific tasks, such as working with spreadsheets, by providing additional training data. While models such as GPT-4 can only be tweaked to a limited extent through an API, a fully open model can be modified more dramatically. “With an open-source model like this, you have a lot more options,” Press says.
Ai2 is releasing Molmo in several sizes today, including a 70 billion parameter model and a 1 billion parameter model that’s small enough to run on mobile devices. The number of parameters in a model refers to the number of units for storing and manipulating data, and roughly corresponds to its capabilities.
Ai2 says that despite its relative small size, Molmo has been carefully trained on high-quality data, making it as capable as much larger commercial models. The new model is also fully open source and, unlike Meta’s Llama, has no restrictions on its use. Ai2 has also published the training data used to create the model, providing researchers with details about how it works.
Publishing powerful models is not without risks: such models could easily be adapted for malicious purposes. For example, we might one day see the emergence of malicious AI agents designed to automate the hacking of computer systems.
Ai2’s Farhadi argues that Molmo’s efficiency and portability will enable developers to build more powerful software agents that run natively on smartphones and other portable devices: “Models with a billion parameters are now performing as well or on par with models at least 10 times larger,” he says.
But building useful AI agents is likely to require more than more efficient multimodal models. A key challenge is making the models work more reliably, which is likely to require further advances in AI’s inference capabilities. OpenAI is attempting to address this with its latest model, o1, which shows incremental inference skills. The next step may be to endow multimodal models with such inference capabilities.
For now, the release of Molmo means that AI agents are more accessible than ever before, and they may soon be useful outside of the giant corporations that dominate the AI world.