The non-profit organization MLCommons recently announced the creation of an AI Safety (AIS) working group focused on developing standardized benchmarks to evaluate key aspects of AI system safety.
The AIS working group aims to create an open platform that pools AI safety tests from multiple contributors. The goal is to support the definition of benchmarks that draw from this test pool to produce overall safety scores for systems, similar to rankings like automotive safety ratings.
The initial priority will be advancing the technology for rigorous, reliable AI safety testing. The working group intends to leverage the expertise of its members and the broader AI research community to guide the evolution of benchmarking methodologies.
Specifically, the working group has the following four major tasks:
- Tests: Curate a pool of safety tests from diverse sources, including facilitating the development of better tests and testing methodologies.
- Benchmarks: Define benchmarks for specific AI use-cases, each of which uses a subset of the tests and summarizes the results in a way that enables decision making by non-experts.
- Platform: Develop a community platform for safety testing of AI systems that supports registration of tests, definition of benchmarks, testing of AI systems, management of test results, and viewing of benchmark scores.
- Governance: Define a set of principles and policies and initiate a broad multi-stakeholder process to ensure trustworthy decision making.
"The open nature of these collaborative benchmarks creates real incentives for researchers to align on common goals for AI safety," said Joaquin Vanschoren, a machine learning professor at Eindhoven University of Technology and AIS participant. "As testing matures, we believe standardized benchmarks will become integral to responsible AI development."
Founding participants include Anthropic, Coactive AI, Eindhoven University of Technology, Google, Inflection, Intel, Meta, Microsoft, NVIDIA, OpenAI, Qualcomm, Stanford, the University of Chicago, and others.
In a blog post, Google shared various ways in which it would be supporting MLCommons efforts to develop AI safety benchmarks:
- Testing platform: Joining with other companies in providing funding to support the development of a testing platform.
- Technical expertise and resources: Providing technical expertise and resources, such as the Monk Skin Tone Examples Dataset, to help ensure that the benchmarks are well-designed and effective.
- Datasets: Contributing an internal dataset for multilingual representational bias, as well as already externalized tests for stereotyping harms, such as SeeGULL and SPICE. Additionally, sharing datasets that focus on collecting human annotations responsibly and inclusively, like DICES and SRP.
In the near-term, MLCommons says the focus will be developing benchmarks to evaluate safety for large language models (LLMs). This builds on the work of Stanford's Center for Research on Foundation Models, including its Holistic Evaluation of Language Models (HELM) framework.
Stanford professor and HELM leader Percy Liang noted, "I'm excited to work with MLCommons to leverage HELM for rigorous LLM safety evaluation, which I've been focused on for years but has become extremely critical given recent AI advancements."
In addition to utilizing HELM, several companies plan to contribute internal safety tests to expand the open pool of evaluations. Over time, the goal is applying these benchmarks more broadly as methodologies mature.
The working group noted standardized safety benchmarks can eventually inform responsible AI development, aligning with ethical frameworks like the EU's upcoming AI Act.
"We believe benchmarks will prove vital for realizing benefits of AI while managing risks," said David Kanter, MLCommons Executive Director. "In collaborating across our community, we aim to build robust benchmarks beginning with open-source models, later expanding to commercial applications."
The concerted efforts of industry leaders, academia, and AI practitioners underline the collective endeavor to ensure responsible AI development. As AI technologies continue to evolve, such collaborative initiatives will be instrumental in navigating the path to AI safety and setting standards for the future.