Evaluation of foundation models, including large language models (LLMs), is as crucial as, if not more so than, training new models and helps us understand the strengths and weaknesses of these models in production environments. When it comes to the medical domain, there are some key barriers to the evaluation of medical foundation models. Currently, benchmarking and evaluation efforts in the medical domain, especially for medical question answering (QA), do not focus on long-form evaluation. Most medical QA benchmarks use a multiple-choice format, which does not accurately reflect real-world usage, such as generating long-form answers in medical AI assistant tools and chatbots or producing medical reports and summaries. Additionally, many well-known medical QA benchmarks may have already been included in the training data of recent models, and those that haven’t been leaked sometimes contain label errors or are outdated [1]. This situation complicates the evaluation process for medical AI systems. While current public leaderboards are valuable for evaluating general-purpose models, they fall short when it comes to the medical domain, which has its own nuances and requires more specific evaluation approaches conducted by experts. Addressing this challenge is no small feat, as building robust medical benchmarks and engaging users, particularly medical experts, at scale is both costly and time-consuming.
We are excited to introduce Lavita’s Medical Evaluation Sphere, an ongoing effort to address these issues. Inspired by initiatives like Chatbot Arena by lmsys.org, we designed the Medical Evaluation Sphere to enable users to participate in the real-time evaluation of medical foundation models on various medical tasks, starting with medical question answering. Our goal is to build a collaborative space for conducting high-quality and comprehensive medical evaluations at scale. This evaluation will help us develop more reliable and accurate medical foundation models, ensuring their responsible deployment in various downstream medical applications.
How it works
In the Medical Evaluation Sphere, the process begins with asking a medical question. Upon pressing the send button, two answers will be generated by two anonymous LLMs, randomly selected from our existing pool of models. This pool includes a mix of open-instruction-tuned medical LLMs and state-of-the-art general-purpose commercial models. Initially, to control inference costs, the pool size is relatively small. However, as more users engage with our platform, we will add more models. Once the answers are generated, users can review them and select the one they believe has better quality from various perspectives, such as helpfulness and correctness. Alternatively, if users find both answers to be equally good or bad, they can select “tie” or “neither” and submit their vote. Additionally, users have the option to engage in a non-anonymous chat with two models of their choice to compare their quality. Votes cast in the non-anonymous mode will not count towards the final leaderboard results. Users can also have multi-turn conversations with the models and submit their votes after completing the conversation.
Although we cannot control whether users ask medical questions, we encourage them to focus on medical and health-related inquiries. When evaluating and updating our leaderboard, we filter out non-medical or non-health-related questions using a prompt designed and verified by our human evaluators.
We use the Elo score [2] to rank models on our leaderboard based on user votes. The Elo rating method assesses the relative skill levels of two players and is commonly used in sports such as chess. After collecting votes, we perform post-processing before updating our leaderboard. For example, we check if any models reveal their names in their responses, which could compromise anonymity, and disregard such votes. As mentioned earlier, votes for non-medical and non-health-related questions are also ignored. Currently, we plan to update the leaderboard offline and periodically, approximately every two weeks, depending on the volume of user votes. We will provide the vote data and scripts to reproduce the rankings.
It is also worth noting that hallucinations or nonsensical responses are anticipated, which is precisely why we conduct such evaluations to better understand various models’ behaviors and limitations. Furthermore, although our endpoints are capable of handling hundreds of concurrent requests, users may occasionally encounter extended inference times. We kindly request users to remain patient until responses are generated and voting buttons are activated for submitting feedback. We are committed to continuously improving our platform to ensure our users’ experience is as seamless as possible.
The road ahead
Our vision is to establish Lavita’s Medical Evaluation Sphere as the premier space for evaluating foundation models for medical tasks. As we engage more users, we plan to expand the range of models and diversify the types of tasks and input modalities for model evaluation. We also aim to implement a role-management system to differentiate between lay users and verified medical experts, which will help us better understand the preferences of each group.
Lavita is built on principles of transparency and community engagement, and we are dedicated to advancing research and development in medical AI. Therefore, we will make all votes and full logs of evaluations on the Medical Evaluation Sphere publicly accessible to our community. We invite everyone, including medical experts and non-experts, to join us in this effort. By collaborating, we can better evaluate foundation models for medical tasks, ultimately leading to the development of safer and more reliable models for various applications in the medical domain.
References
[1] Saab, Khaled, et al. “Capabilities of gemini models in medicine.” arXiv preprint arXiv:2404.18416 (2024).