Rank | System | LLM | Corpus | Retriever | MMLU-Med (%) | MedQA-US (%) | MedMCQA (%) | PubMedQA* (%) | BioASQ-Y/N (%) | Average |
---|---|---|---|---|---|---|---|---|---|---|
0 |
GPT-4 + MedRAG UVa & NIH (Xiong et al., 2024) |
GPT-4-32k-0613 (OpenAI et al., 2023) |
MedCorp | RRF-4 | 87.24 | 82.80 | 66.65 | 70.60 | 92.56 | 79.97 |
0 |
Llama-3 + MedRAG UVa & NIH (Xiong et al., 2024) |
Llama-3-70B (Dubey et al., 2024) |
MedCorp | RRF-4 | 85.58 | 76.90 | 68.87 | 71.60 | 89.97 | 78.59 |
0 |
Llama-3 + CoT UVa & NIH (Xiong et al., 2024) |
Llama-3-70B (Dubey et al., 2024) |
-- | -- | 85.77 | 80.91 | 70.93 | 59.00 | 83.01 | 75.92 |
0 |
SimRAG 8B Emory & Amazon (Xu et al., 2024) |
Llama-3-8B (Dubey et al., 2024) |
MedCorp | RRF-4 | 75.57 | 62.92 | 67.51 | 80.00 | 91.75 | 75.55 |
0 |
SimRAG 27B Emory & Amazon (Xu et al., 2024) |
Gemma-2-27B (Team et al., 2024) |
MedCorp | RRF-4 | 81.63 | 63.63 | 64.16 | 73.60 | 92.07 | 75.02 |
0 |
GPT-4 + CoT UVa & NIH (Xiong et al., 2024) |
GPT-4-32k-0613 (OpenAI et al., 2023) |
-- | -- | 89.44 | 83.97 | 69.88 | 39.60 | 84.30 | 73.44 |
0 |
GPT-3.5 + MedRAG UVa & NIH (Xiong et al., 2024) |
GPT-3.5-16k-0613 (Brown et al., 2020) |
MedCorp | RRF-4 | 75.48 | 66.61 | 58.04 | 67.40 | 90.29 | 71.57 |
0 |
Gemini-1.0-Pro + MedRAG UVa & NIH (Xiong et al., 2024) |
Gemini-1.0-Pro (Google et al., 2024) |
MedCorp | RRF-4 | 73.65 | 61.90 | 59.65 | 74.60 | 86.89 | 71.34 |
0 |
Mixtral + MedRAG UVa & NIH (Xiong et al., 2024) |
Mixtral-8x7B (Jiang et al., 2024) |
MedCorp | RRF-4 | 75.85 | 60.02 | 56.42 | 67.60 | 87.54 | 69.48 |
0 |
Orca-3 + MedRAG Microsoft (Mitra et al., 2024) |
Orca-3-7B (Mitra et al., 2024) |
MedCorp | RRF-4 | 71.17 | 51.85 | 57.95 | 58.20 | 82.20 | 64.27 |
0 |
Gemini-1.0-Pro + CoT UVa & NIH (Xiong et al., 2024) |
Gemini-1.0-Pro (Google et al., 2024) |
-- | -- | 72.54 | 60.49 | 55.44 | 46.40 | 76.86 | 62.35 |
0 |
Mixtral + CoT UVa & NIH (Xiong et al., 2024) |
Mixtral-8x7B (Jiang et al., 2024) |
-- | -- | 74.01 | 64.10 | 56.28 | 35.20 | 77.51 | 61.42 |
0 |
GPT-3.5 + CoT UVa & NIH (Xiong et al., 2024) |
GPT-3.5-16k-0613 (Brown et al., 2020) |
-- | -- | 72.91 | 65.04 | 55.25 | 36.00 | 74.27 | 60.69 |
0 |
MEDITRON + MedRAG UVa & NIH (Xiong et al., 2024) |
MEDITRON-70B (Chen et al., 2023) |
MedCorp | RRF-4 | 65.38 | 49.57 | 52.67 | 56.40 | 76.86 | 60.18 |
0 |
Orca-3 + CoT Microsoft (Mitra et al., 2024) |
Orca-3-7B (Mitra et al., 2024) |
-- | -- | 71.35 | 55.38 | 51.33 | 27.80 | 75.24 | 56.22 |
0 |
MEDITRON + CoT UVa & NIH (Xiong et al., 2024) |
MEDITRON-70B (Chen et al., 2023) |
-- | -- | 64.92 | 51.69 | 46.74 | 53.40 | 68.45 | 57.04 |
0 |
Llama-2 + MedRAG UVa & NIH (Xiong et al., 2024) |
Llama-2-70B (Touvron et al., 2023) |
MedCorp | RRF-4 | 54.55 | 44.93 | 43.08 | 50.40 | 73.95 | 53.38 |
0 |
PMC-LLaMA + MedRAG UVa & NIH (Xiong et al., 2024) |
PMC-LLaMA-13B (Wu et al., 2023) |
MedCorp | RRF-4 | 52.53 | 42.58 | 48.29 | 56.00 | 65.21 | 52.92 |
0 |
PMC-LLaMA + CoT UVa & NIH (Xiong et al., 2024) |
PMC-LLaMA-13B (Wu et al., 2023) |
-- | -- | 52.16 | 44.38 | 46.55 | 55.80 | 63.11 | 52.40 |
0 |
Orca-2.5 + CoT Microsoft (Mitra et al., 2024) |
Orca-2.5-7B (Mitra et al., 2024) |
-- | -- | 63.91 | 51.37 | 43.65 | 29.60 | 71.04 | 51.92 |
0 |
Llama-2 + CoT UVa & NIH (Xiong et al., 2024) |
Llama-2-70B (Touvron et al., 2023) |
-- | -- | 57.39 | 47.84 | 42.60 | 42.20 | 61.17 | 50.24 |
0 |
Orca-2.5 + MedRAG Microsoft (Mitra et al., 2024) |
Orca-2.5-7B (Mitra et al., 2024) |
MedCorp | RRF-4 | 53.72 | 37.08 | 39.23 | 19.00 | 69.09 | 43.62 |