Evaluating the quality of medical content on YouTube using large language models

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

YouTube has become a dominant source of medical information and health-related decision-making. Yet, many videos on this platform contain inaccurate or biased information. Although expert reviews could help mitigate this situation, the vast number of daily uploads makes this solution impractical. In this study, we explored the potential of Large Language Models (LLMs) to assess the quality of medical content on YouTube. We collected a set of videos previously evaluated by experts and prompted twenty models to rate their quality using the DISCERN instrument. We then analyzed the inter-rater agreement between the language models’ and experts’ ratings using Brennan–Prediger’s (BP) Kappa. We found that LLMs exhibited a wide range of inter-rater agreements with the experts (ranging from −1.10 to 0.82). All models tended to give higher scores than the human experts. The agreement on individual questions tended to be lower, with some questions showing significant disagreement between models and experts. Including scoring guidelines in the prompt has improved model performance. We conclude that some LLMs are capable of evaluating the quality of medical videos. If used as stand-alone expert systems or embedded into traditional recommender systems, these models can mitigate the quality issue of health-related online videos.

Original languageBritish English
Article number9906
JournalScientific Reports
Volume15
Issue number1
DOIs
StatePublished - Dec 2025

Keywords

  • Content quality
  • LLMs
  • Medical content
  • YouTube

Fingerprint

Dive into the research topics of 'Evaluating the quality of medical content on YouTube using large language models'. Together they form a unique fingerprint.

Cite this