MIRAGE

A Benchmark for Medical Information Retrieval-Augmented Generation Evaluation
Rank Model LLM Corpus Retriever MMLU-Med (%) MedQA-US (%) MedMCQA (%) PubMedQA* (%) BioASQ-Y/N (%) Average
0
GPT-4 (MedRAG)
UVa & NIH
(Xiong et al., 2024)
GPT-4-32k-0613 MedCorp RRF-4 87.24 82.80 66.65 70.60 92.56 79.97
0
Llama-3 (MedRAG)
UVa & NIH
(Xiong et al., 2024)
Llama-3-70B MedCorp RRF-4 85.58 76.90 68.87 71.60 89.97 78.59
0
Llama-3 (CoT)
Meta
(Meta et al., 2024)
Llama-3-70B -- -- 85.77 80.91 70.93 59.00 83.01 75.92
0
GPT-4 (CoT)
OpenAI
(OpenAI et al., 2023)
GPT-4-32k-0613 -- -- 89.44 83.97 69.88 39.60 84.30 73.44
0
GPT-3.5 (MedRAG)
UVa & NIH
(Xiong et al., 2024)
GPT-3.5-16k-0613 MedCorp RRF-4 75.48 66.61 58.04 67.40 90.29 71.57
0
Gemini-1.0-Pro (MedRAG)
UVa & NIH
(Xiong et al., 2024)
Gemini-1.0-Pro MedCorp RRF-4 73.65 61.90 59.65 74.60 86.89 71.34
0
Mixtral (MedRAG)
UVa & NIH
(Xiong et al., 2024)
Mixtral-8x7B MedCorp RRF-4 75.85 60.02 56.42 67.60 87.54 69.48
0
Gemini-1.0-Pro (CoT)
Google
(Google et al., 2024)
Gemini-1.0-Pro -- -- 72.54 60.49 55.44 46.40 76.86 62.35
0
Mixtral (CoT)
Mistral AI
(Jiang et al., 2024)
Mixtral-8x7B -- -- 74.01 64.10 56.28 35.20 77.51 61.42
0
GPT-3.5 (CoT)
OpenAI
(Brown et al., 2020)
GPT-3.5-16k-0613 -- -- 72.91 65.04 55.25 36.00 74.27 60.69
0
MEDITRON (MedRAG)
UVa & NIH
(Xiong et al., 2024)
MEDITRON-70B MedCorp RRF-4 65.38 49.57 52.67 56.40 76.86 60.18
0
MEDITRON (CoT)
EPFL
(Chen et al., 2023)
MEDITRON-70B -- -- 64.92 51.69 46.74 53.40 68.45 57.04
0
Llama-2 (MedRAG)
UVa & NIH
(Xiong et al., 2024)
Llama-2-70B MedCorp RRF-4 54.55 44.93 43.08 50.40 73.95 53.38
0
PMC-LLaMA (MedRAG)
UVa & NIH
(Xiong et al., 2024)
PMC-LLaMA-13B MedCorp RRF-4 52.53 42.58 48.29 56.00 65.21 52.92
0
PMC-LLaMA (CoT)
SJTU
(Wu et al., 2023)
PMC-LLaMA-13B -- -- 52.16 44.38 46.55 55.80 63.11 52.40
0
Llama-2 (CoT)
Meta
(Touvron et al., 2023)
Llama-2-70B -- -- 57.39 47.84 42.60 42.20 61.17 50.24