A new benchmark for evaluating multimodal systems based on real-world video, audio, and text data
From the Turing test to ImageNet, benchmarks have played a key role in shaping artificial intelligence (AI) by helping define research goals and allowing researchers to measure progress toward those goals. Incredible advances in the last 10 years, such as AlexNet in computer vision and AlphaFold in protein folding, are closely related to the use of benchmark datasets, allowing researchers to rank model design and training options and refine their models. As we work toward the goal of building Artificial General Intelligence (AGI), developing robust and effective benchmarks that extend the capabilities of AI models is as important as developing the models itself.
Perception – the process of experiencing the world through the senses – is an important part of intelligence. And designing agents with a human-level perceptual understanding of the world is a central but challenging task, one that will become increasingly important in robotics, self-driving cars, personal assistants, medical imaging, and more. So today, we introduce Comprehension testA multimodel benchmark that uses real-world videos to help evaluate a model’s understanding capabilities.
Developing an awareness benchmark
Many perception-related benchmarks are currently used in AI research, such as Kinetics for video action recognition, AudioSet for audio event classification, MOT for object tracking, or VQA for image question-answering. These benchmarks have led to tremendous advances in how AI model architectures and training methods are built and developed, but each targets only limited aspects of perception: image benchmarks exclude temporal aspects; Visual question-answering focuses on higher-level semantic visual understanding; Object tracking tasks typically capture the low-level appearance of individual objects, such as color or texture. And very few benchmarks define tasks on both audio and visual modes.
Multimodal models such as Perceiver, Flamingo or BEiT-3 aim to be more general perceptual models. But their evaluations are based on multiple unique datasets because no unique benchmark is available. This process is slow, expensive, and provides incomplete coverage of common cognitive abilities such as memory, making it difficult for researchers to compare methods.
To address several of these issues, we created a dataset of purpose-built videos of real-world activities labeled according to six different types of tasks:
- Object Tracking: A box is provided around an object at the beginning of the video, the model should provide a complete track throughout the entire video (including occlusions).
- Point Tracking: A point is selected early in the video, the model must track the point throughout the video (and beyond).
- Temporal Action Localization: The model must be temporally localized and characterize a set of predefined actions.
- Temporal Sound Localization: The model must be temporally localized and classify a predefined group of sounds.
- Multiple-Choice Video Question-Answer: Text questions about the video, each with three options to choose from.
- Grounded Video Question-Answer: For text queries about video, the model should return one or more object tracks.
We drew inspiration from approaches to assessing children’s understanding in developmental psychology, as well as synthetic datasets such as CATER and CLEVRER, and created 37 video scripts to ensure a balanced dataset, each with different variations. Each variation was filmed by at least a dozen crowd-sourced participants (similar to previous work on charades and something-something), with more than 100 participants in total, resulting in 11,609 videos, averaging 23 seconds in length.
The videos show simple games or everyday activities that allow us to define tasks that require the following skills to solve:
- Knowledge of Arthashastra: Testing items such as task completion, recognition of objects, actions or sounds.
- Understanding of Physics: Conflicts, movement, invasions, territorial relations.
- Temporal reasoning or memory: Temporal ordering of events, counting over time, identifying changes in a scene.
- Extraction Capacities: Shape matching, same/different concepts, pattern recognition.
Crowd-sourced participants labeled the videos with spatial and temporal annotations (object bounding box tracks, point tracks, action segments, sound segments). Our research team designed questions for each script type for multiple-choice and grounded video-question answering tasks, to ensure a good variety of tested skills, for example, questions examining the ability to reason counterfactually or provide explanations for a given situation. Answers for each video were again provided by crowdsourced participants.
Evaluating multimodal systems with a perception test
We assume that models are pre-trained on external datasets and tasks. The perception test includes a small fine-tuning set (20%) that model creators can optionally use to inform models of the nature of tasks. The remaining data (80%) consists of a public validation split and a hold-out test split where performance can only be evaluated by our evaluation server.
Here we show a diagram of the evaluation setup: inputs video and audio sequence and task specification. For an object tracking task the task can be in the form of high-level text, or for low-level input the answer to visual questions, such as the coordinates of the object’s bounding box.
Evaluation results are described in several dimensions and we measure abilities in six computational tasks. For the visual question-answering tasks we also provide a mapping of the questions to the types of situations shown in the videos and the types of reasoning required to answer the questions for more detailed analysis (see our paper for more details). An ideal model maximizes scores across all radar plots and all measurements. It is a detailed assessment of the model’s skills that allows us to narrow down areas of improvement.
Ensuring the diversity of participants and scenarios shown in the videos was a critical consideration when developing the benchmark. To do this, we selected participants from different countries of different races and genders and aimed to have a diverse representation of each type of video script.
Learn more about Perception Test
The perception test benchmark is publicly available here and more details are available in our paper. A leaderboard and challenge server will also be available soon.
On October 23, 2022, we are holding a workshop about general awareness models at the European Conference on Computer Vision in Tel Aviv (ECCV 2022), where we will discuss our approach and how to build and evaluate general awareness models with other leading experts in the field.
We hope that the Perception Test will inspire and guide further research toward general perception models. Moving forward, we hope to collaborate with the multimodal research community to introduce additional annotations, tasks, metrics, or even new languages to the benchmark.
Get in touch by emailing firstname.lastname@example.org if you’re interested in contributing!