AI in Clinical Medicine, ISSN 2819-7437 online, Open Access
Article copyright, the authors; Journal compilation copyright, AI Clin Med and Elmer Press Inc
Journal website https://aicm.elmerpub.com

Original Article

Volume 1, 2025, e12


Evaluating ChatGPT-5o as a Clinical Decision-Support Tool in Inflammatory Bowel Disease: A Pilot Study of Guideline Adherence and Clinical Agreement

Figures

↓  Figure 1. Bar chart for Cohen’s Kappa values by category and comparison.
Figure 1.
↓  Figure 2. Cohen’s Kappa agreement levels across different categories. The color scale represents the level of agreement: dark blue (1.0): perfect agreement (e.g., anti-IL-23, antibiotics, diagnostic workup, symptom management, surgical consultation, continuous monitoring). Medium blue (≈ 0.6 - 0.8): substantial agreement. Light blue (≈ 0.3 - 0.5): moderate to fair agreement, indicating some variability in alignment.
Figure 2.

Tables

↓  Table 1. Interpretation of Agreement Levels
 
Agreement level Interpretation
Perfect (Kappa = 1.000) Complete alignment, no variation between recommendations.
Substantial (Kappa ≈ 0.6 - 0.8) High agreement with some minor differences, reflecting consistent recommendations.
Moderate to fair (Kappa ≈ 0.3 - 0.5) Notable variability, indicating different interpretations or thresholds in decision-making.
Non-significant Agreement level is not statistically meaningful; recommendations may vary substantially.

 

↓  Table 2. Summary of Cohen’s Kappa Agreement for Each Category
 
Category Human vs. guidelines Kappa Human vs. ChatGPT Kappa ChatGPT vs. guidelines Kappa Agreement level Significance
5-ASA: 5-aminosalicylic acid; IL: interleukin; TNF: tumor necrosis factor.
5-ASA 0.689 0.689 0.604 Substantial Yes
Steroids 0.406 0.578 0.771 Moderate to substantial Yes
Anti-TNF 0.289 0.345 0.685 Fair to substantial Yes
Anti-integrins 0.486 0.336 0.771 Fair to substantial Yes
Anti-IL-23 1.000 1.000 1.000 Perfect Yes
Thiopurines 0.441 0.313 0.771 Fair to substantial No
Antibiotics 0.642 1.000 1.000 Substantial to perfect Yes
Diagnostic workup 1.000 1.000 1.000 Perfect Yes
Symptom management - 1.000 1.000 Perfect Yes
Surgical consult - 1.000 1.000 Perfect Yes
Continuous monitoring - 1.000 1.000 Perfect Yes

 

↓  Table 3. Explanation of Cohen’s Kappa Agreement for Each Category
 
Category Human vs. guidelines Kappa Human vs. guidelines P-value Human vs. ChatGPT Kappa Human vs. ChatGPT P-value ChatGPT vs. guidelines Kappa ChatGPT vs. guidelines P-value
5-ASA: 5-aminosalicylic acid; IL: interleukin; TNF: tumor necrosis factor.
5-ASA 0.689 0.002 0.689 0.002 0.604 0.008
Steroids 0.406 0.028 0.578 0.005 0.771 0.001
Anti-TNF 0.289 0.073 0.345 0.047 0.685 0.003
Anti-integrins 0.486 0.013 0.336 0.05 0.771 0.001
Anti-IL-23 1 0 1 0 1 0
Thiopurines 0.441 0.054 0.313 0.161 0.771 0.001
Antibiotics 0.642 0.003 1 0 1 0
Diagnostic workup 1 0 1 0 1 0
Symptom management 1 0 1 0
Surgical consult 1 0 1 0
Continuous monitoring 1 0 1 0