"Can we trust them?" An expert evaluation of large language models to provide sleep and jet lag recommendations for athletes
("Können wir ihnen vertrauen?" Eine Expertenbewertung von Large-Language-Modellen zur Bereitstellung von Empfehlungen zu Schlaf und Jetlag für Sportler)
Background: With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes.
Objective: This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes.
Methods: Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests.
Results: Experts` response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI = 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall.
Conclusions: This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.
Key Points:
ChatGPT-4 consistently delivered more accurate and professional answers, particularly in addressing sleep-related questions, outperforming both ChatGPT-3.5 and Google Bard.
While large language models (LLMs) performed well on sleep-related queries, they struggled with more complex topics like jet lag, highlighting the need for further development and refinement.
LLMs can enhance information accessibility and decision-making in sports science but must be used as part of an expert-driven process to ensure accuracy and consistency.
© Copyright 2026 Sports Medicine. Springer. Alle Rechte vorbehalten.
| Schlagworte: | |
|---|---|
| Notationen: | Biowissenschaften und Sportmedizin Naturwissenschaften und Technik Ausbildung und Forschung |
| Tagging: | künstliche Intelligenz Vertrauen |
| Veröffentlicht in: | Sports Medicine |
| Sprache: | Englisch |
| Veröffentlicht: |
2026
|
| Jahrgang: | 56 |
| Heft: | 1 |
| Seiten: | 257-270 |
| Dokumentenarten: | Artikel |
| Level: | hoch |