Skip to main navigation Skip to search Skip to main content

Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence

  • Norbert Tihanyi
  • , Tamas Bisztray
  • , Richard A. Dubniczky
  • , Rebeka Toth
  • , Bertalan Borsos
  • , Bilel Cherif
  • , Ridhi Jain
  • , Lajos Muzsai
  • , Mohamed Amine Ferrag
  • , Ryan Marinelli
  • , Lucas C. Cordeiro
  • , Merouane Debbah
  • , Vasileios Mavroeidis
  • , Audun Jøsang
  • University of Oslo
  • Eötvös Loránd University
  • Guelma University
  • University of Manchester

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often simplistic, allowing models to perform uniformly well and making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs that the models might memorize or guess. To address these limitations, we introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats, including text, PDFs, compiled binaries, visual puzzles, and CTF-style cybersecurity challenges. Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts. These metrics revealed that even simple questions are frequently answered incorrectly when posed in varying forms, highlighting significant gaps in models' reliability. Notably, API models like GPT-4o often overestimated their mathematical capabilities, while ChatGPT-4o demonstrated better performance due to effective tool usage. In self-assessment OpenAI's o1-mini proved to have the best judgement on what tasks it should attempt to solve. We evaluated 25 state-of-the-art LLMs using DIA-Bench, showing that current models struggle with complex tasks and often display unexpectedly low confidence, even with simpler questions. The DIA framework sets a new standard for assessing not only problem-solving, but also a model's adaptive intelligence and ability to assess its limitations. The dataset is publicly available on the project's page: https://github.com/DIA-Bench.

Original languageBritish English
Title of host publicationProceedings - 2024 IEEE International Conference on Big Data, BigData 2024
EditorsWei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3313-3321
Number of pages9
ISBN (Electronic)9798350362480
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Big Data, BigData 2024 - Washington, United States
Duration: 15 Dec 202418 Dec 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Big Data, BigData 2024

Conference

Conference2024 IEEE International Conference on Big Data, BigData 2024
Country/TerritoryUnited States
CityWashington
Period15/12/2418/12/24

Keywords

  • Artificial Intelligence
  • Dynamic Benchmarking
  • Large Language Models
  • Performance Metrics
  • Reliability

Fingerprint

Dive into the research topics of 'Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence'. Together they form a unique fingerprint.

Cite this