MIRAGE

A Benchmark for Medical Information Retrieval-Augmented Generation Evaluation
Rank System LLM Corpus Retriever MMLU-Med (%) MedQA-US (%) MedMCQA (%) PubMedQA* (%) BioASQ-Y/N (%) Average
0
GPT-4 + MedRAG
UVa & NIH
(Xiong et al., 2024)
GPT-4-32k-0613
(OpenAI et al., 2023)
MedCorp RRF-4 87.24 82.80 66.65 70.60 92.56 79.97
0
Llama-3 + MedRAG
UVa & NIH
(Xiong et al., 2024)
Llama-3-70B
(Dubey et al., 2024)
MedCorp RRF-4 85.58 76.90 68.87 71.60 89.97 78.59
0
Llama-3 + CoT
UVa & NIH
(Xiong et al., 2024)
Llama-3-70B
(Dubey et al., 2024)
-- -- 85.77 80.91 70.93 59.00 83.01 75.92
0
SimRAG 8B
Emory & Amazon
(Xu et al., 2024)
Llama-3-8B
(Dubey et al., 2024)
MedCorp RRF-4 75.57 62.92 67.51 80.00 91.75 75.55
0
SimRAG 27B
Emory & Amazon
(Xu et al., 2024)
Gemma-2-27B
(Team et al., 2024)
MedCorp RRF-4 81.63 63.63 64.16 73.60 92.07 75.02
0
GPT-4 + CoT
UVa & NIH
(Xiong et al., 2024)
GPT-4-32k-0613
(OpenAI et al., 2023)
-- -- 89.44 83.97 69.88 39.60 84.30 73.44
0
GPT-3.5 + MedRAG
UVa & NIH
(Xiong et al., 2024)
GPT-3.5-16k-0613
(Brown et al., 2020)
MedCorp RRF-4 75.48 66.61 58.04 67.40 90.29 71.57
0
Gemini-1.0-Pro + MedRAG
UVa & NIH
(Xiong et al., 2024)
Gemini-1.0-Pro
(Google et al., 2024)
MedCorp RRF-4 73.65 61.90 59.65 74.60 86.89 71.34
0
Mixtral + MedRAG
UVa & NIH
(Xiong et al., 2024)
Mixtral-8x7B
(Jiang et al., 2024)
MedCorp RRF-4 75.85 60.02 56.42 67.60 87.54 69.48
0
Orca-3 + MedRAG
Microsoft
(Mitra et al., 2024)
Orca-3-7B
(Mitra et al., 2024)
MedCorp RRF-4 71.17 51.85 57.95 58.20 82.20 64.27
0
Gemini-1.0-Pro + CoT
UVa & NIH
(Xiong et al., 2024)
Gemini-1.0-Pro
(Google et al., 2024)
-- -- 72.54 60.49 55.44 46.40 76.86 62.35
0
Mixtral + CoT
UVa & NIH
(Xiong et al., 2024)
Mixtral-8x7B
(Jiang et al., 2024)
-- -- 74.01 64.10 56.28 35.20 77.51 61.42
0
GPT-3.5 + CoT
UVa & NIH
(Xiong et al., 2024)
GPT-3.5-16k-0613
(Brown et al., 2020)
-- -- 72.91 65.04 55.25 36.00 74.27 60.69
0
MEDITRON + MedRAG
UVa & NIH
(Xiong et al., 2024)
MEDITRON-70B
(Chen et al., 2023)
MedCorp RRF-4 65.38 49.57 52.67 56.40 76.86 60.18
0
Orca-3 + CoT
Microsoft
(Mitra et al., 2024)
Orca-3-7B
(Mitra et al., 2024)
-- -- 71.35 55.38 51.33 27.80 75.24 56.22
0
MEDITRON + CoT
UVa & NIH
(Xiong et al., 2024)
MEDITRON-70B
(Chen et al., 2023)
-- -- 64.92 51.69 46.74 53.40 68.45 57.04
0
Llama-2 + MedRAG
UVa & NIH
(Xiong et al., 2024)
Llama-2-70B
(Touvron et al., 2023)
MedCorp RRF-4 54.55 44.93 43.08 50.40 73.95 53.38
0
PMC-LLaMA + MedRAG
UVa & NIH
(Xiong et al., 2024)
PMC-LLaMA-13B
(Wu et al., 2023)
MedCorp RRF-4 52.53 42.58 48.29 56.00 65.21 52.92
0
PMC-LLaMA + CoT
UVa & NIH
(Xiong et al., 2024)
PMC-LLaMA-13B
(Wu et al., 2023)
-- -- 52.16 44.38 46.55 55.80 63.11 52.40
0
Orca-2.5 + CoT
Microsoft
(Mitra et al., 2024)
Orca-2.5-7B
(Mitra et al., 2024)
-- -- 63.91 51.37 43.65 29.60 71.04 51.92
0
Llama-2 + CoT
UVa & NIH
(Xiong et al., 2024)
Llama-2-70B
(Touvron et al., 2023)
-- -- 57.39 47.84 42.60 42.20 61.17 50.24
0
Orca-2.5 + MedRAG
Microsoft
(Mitra et al., 2024)
Orca-2.5-7B
(Mitra et al., 2024)
MedCorp RRF-4 53.72 37.08 39.23 19.00 69.09 43.62