Meta AI today announced the release of ImageBind, a new AI model capable of learning from and connecting information across six types of data: text, images/video, audio, depth, thermal, and motion sensors. ImageBind creates a single shared representation space for these modalities without requiring paired data samples containing all possible combinations.
ImageBind leverages large vision-language models and extends their zero-shot learning abilities to new modalities using naturally paired data. For example, video-audio pairs teach the model the relationship between visual and auditory data.
In their paper, researchers show that image-paired data alone suffices to align six modalities, allowing the model to link content across modalities without directly observing them together. This enables other AI models to grasp new modalities without intensive training.
ImageBind exhibits strong scaling behavior, with abilities improving significantly in larger versions. Emergent capabilities include predicting which audio matches an image or estimating depth from a photo. ImageBind outperforms previous models specialized in single modalities like audio or depth for tasks such as classification and zero-shot retrieval. Gains reached 40% in some experiments.
The research opens creative opportunities like adding sound to video/images, generating video from images/audio, and segmenting images based on audio. Additional modalities could produce richer multimodal AI. However, the field needs to better understand scaling behaviors in large models, how to evaluate them, and enable applications.
Today's release marks progress in Meta’s pursuit of multimodal AI that learns from all available data. Meta AI has invested heavily in vision-related models, with innovations like ImageBind, DINOv2, and SAM enabling applications in AR/VR and Meta’s ambitions to build the metaverse. A demo site for ImageBind is available for researchers to explore its capabilities.
Models like ImageBind inch us closer to human-level intelligence. They demonstrate that with enough data and computing power, machines could develop capacities adjacent to intertwined, multisensory understanding as in human cognition. While still narrow in scope, ImageBind and related work at Meta point to the possibilities ahead in artificial general intelligence—for better and for worse. Researchers worldwide must grapple with managing risks from advanced AI if progress marches on apace. For now, Meta’s releases contribute another piece in the puzzle of what tomorrow’s most powerful technologies may make possible.