Benchmarking Retrieval-Augmented Generation for Medicine

Abstract

While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes.

To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (Mirage), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using Mirage, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRag toolkit introduced in this work.

Overall, MedRag improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.

Mirage

Mirage is our proposed benchmark for Medical Information Retrieval-Augmented Generation Evaluation, which includes 7,663 questions from five commonly used QA datasets in biomedicine. It adopts four key evalution setting: 1) zero-shot learning, 2) multi-choice evaluation, 3) retrieval-augmented generation, 4) question-only retrieval.

MedRag

MedRag a systematic toolkit for Retrieval-Augmented Generation (RAG) on medical question answering (QA), which covers five corpora, four retrievers, and six LLMs including both general and domain-specific models. For all LLMs, it concatenates and prepends retrieved snippets from corpora to the question input, and perform chain-of-thought (CoT) prompting to fully leverage the reasoning capability of the models.

Comparison of LLMs

While the best average score of other backbone LLMs can only achieve about 61% (GPT-3.5 and Mixtral) in the CoT setting, their performance can be significantly improved to around 70% with MedRag, which is comparable to GPT-4 (CoT). These results suggest the great potential of RAG as a way to enhance the zero-shot capability of LLMs to answer medical questions, which can be a more efficient choice than performing larger-scale pre-training. Our results also demonstrate that domain-specific LLMs can exhibit advantages in certain cases

Table: Benchmark results of different backbone LLMs on Mirage. All numbers are accuracy in percentages.

Comparison of Corpora and Retrievers

As shown in the table below, the performance of one RAG system is strongly related to the corpus it selects. It shows the variable performance of different retrievers, which can be explained by the data and strategy differences in their training. The fusion of retrieval results with RRF can effectively improves the performance, but may not always lead to a better performance.

Table: Accuracy (%) of GPT-3.5 (MedRag) with different corpora and retrievers on Mirage. Red and green denote performance decreases and increases compared to CoT (first row). The shade reflects the relative change.

Performance Scaling

The figure below shows the scaling curves of MedRag on each task in Mirage with different numbers of snippets k ∈ {1, 2, 4, ..., 64}. On MMLU-Med, MedQA-US, and MedMCQA, we see roughly log-linear curves in the scaling plots for k ≤ 32. Compared with the three examination tasks, PubMedQA* and BioASQ-Y/N can be relatively easier for MedRag since the ground-truth supporting information can be found in PubMed.

Figure: MedRag accuracy with different numbers of retrieved snippets. Red dotted lines denote CoT performance.

Position of Ground-truth Snippet

This following figure shows the changes in model accuracy corresponding to different parts of context locations. From the figure, we can see a clear U-shaped decreasing-then-increasing pattern in the accuracy change concerning the position of ground-truth snippets.

Figure: The relations between QA accuracy and the position of the ground-truth snippet in the LLM context.

Proportion in the MedCorp Corpus

This figure displays the proportions of four different sources in MedCorp and the actually retrieved sources in the top 64 retrieved snippets for each task in Mirage, which shows task-specific patterns.

Figure: The overall corpus composition of MedCorp and the actually retrieved proportion in different tasks.

Corpus selection

Results above indicate that PubMed and the MedCorp corpus are the only corpora with which MedRag can outperform CoT on all tasks in Mirage. As a large-scale corpus, PubMed serves as a suitable document collection for various kinds of medical questions. If resources permit, the MedCorp corpus could be a more comprehensive and reliable choice: Nearly all MedRag settings using the MedCorp Corpus show improved performance (green-coded cells) compared to the CoT prompting baseline. In general, single corpora other than PubMed are not recommended for medical QA due to their limited volumes of medical knowledge, but they can also be beneficial in specific tasks such as question answering for medical examinations.

Retriever selection

Among the four individual retrievers used in MedRag, MedCPT is the most reliable one which constantly outperforms other candidates with a higher average score on Mirage. BM25 is a strong retriever as well, which is also supported by other evaluations. The fusion of retrievers can provide robust performance but must be utilized with caution for the retrievers included. As for the PubMed corpus recommended above, a RRF-2 retriever that combines the results from BM25 and MedCPT can be a good selection, since they perform better than the other two with snippets from PubMed. For the MedCorp corpus, both RRF-2 and RRF-4 can be reliable choices, as the corpus can benefit all four individual retrievers in MedRag.

LLM selection

Currently, GPT-4 is the best model with about 80% accuracy on Mirage. However, it is much more expensive than other backbone LLMs. GPT-3.5 can be a more cost-efficient choice than GPT-4, which shows great capabilities of following MedRag instructions. For high-stakes scenarios such as medical diagnoses where patient privacy should be a key concern, the best open-source Mixtral model, which can be deployed locally and run offline, could be a viable option.

BibTeX

@inproceedings{xiong-etal-2024-benchmarking,
    title = "Benchmarking Retrieval-Augmented Generation for Medicine",
    author = "Xiong, Guangzhi  and
      Jin, Qiao  and
      Lu, Zhiyong  and
      Zhang, Aidong",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.372",
    doi = "10.18653/v1/2024.findings-acl.372",
    pages = "6233--6251",
    abstract = "While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18{\%} over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the {``}lost-in-the-middle{''} effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.",
}