Two of San Francisco’s leading players in artificial intelligence have challenged the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specializes in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.
Featuring prizes of US$5,000 (£3,800) for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving “expert-level AI systems” using the “largest, broadest coalition of experts in history.”
Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics and law, but it’s hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.