Both healthcare providers and patients use the Internet to obtain quick medical information. It is therefore not surprising that fertility-focused content has been extensively examined over the years. Unfortunately, a single Google search for the word “infertility” brings up millions of results, but the content has not been verified for medical accuracy.
Advances in natural language processing (NLP), a branch of artificial intelligence (AI), have enabled computers to learn human language and use it to communicate. Recently, OpenAI developed an AI chatbot called ChatGPT that allows human users to converse with computer interfaces.
study: The Possibilities and Dangers of Using Large Language Models to Obtain Clinical Information: ChatGPT Works Strongly as an Infertility Counseling Tool, Despite its Limitations
Recent fertility and infertility In this study, we used fertility as a domain to test the performance of ChatGPT and evaluate its usage as a clinical tool.
Recent evolution of ChatGPT
ChatGPT’s uniqueness can be attributed to its ability to perform linguistic tasks such as writing articles, answering questions, and telling jokes. These features were developed following recent advances in new deep learning (DL) algorithms.
For example, Generative Pretrained Transformer 3 (GPT-3) is a DL algorithm known for its massive training data set of 57 billion words and 175 billion parameters from various sources.
In November 2022, ChatGPT was first released as an updated version of the GPT-3.5 model. It then became the fastest growing app of all time and within two months of its release he had over 100 million users.
While there is potential for using ChatGPT as a clinical tool for patient access to medical information, there are some limitations to using this model for clinical information.
As of February 2023, ChatGPT was trained on data through 2021. Therefore, it is not equipped with the latest data. In addition, one of the major concerns about its use is the generation of plagiarized and inaccurate information.
Due to its ease of use and human-like language, patients are tempted to use this application to ask questions about their health and get answers. It is therefore imperative to characterize the performance of this model as a clinical tool and elucidate whether it provides misleading answers.
The current study tested the ChatGPT “February 13th” version to assess its consistency in answering fertility-related clinical questions that patients might ask the chatbot. ChatGPT’s performance was evaluated based on his three domains.
The first domain was associated with infertility FAQs on the Centers for Disease Control and Prevention (CDC) website. Contains 17 frequently asked questions such as “What is infertility?” or “How do doctors treat infertility?”
These questions were entered into ChatGPT during one session. Answers generated by ChatGPT were compared with those provided by the CDC.
In the second area, we took advantage of important research on fertility. The Cardiff Fertility Knowledge Scale (CFKS) questionnaire was used in this area, including questions on fertility, misconceptions and risk factors for fertility disorders. In addition, the Fertility Knowledge Score (FIT-KS) survey questionnaire was also used to assess the performance of ChatGPT.
A third domain focused on evaluating the chatbot’s ability to replicate clinical standards in providing medical advice. This area was built on the American Society for Reproductive Medicine (ASRM) Committee Opinion Optimizing Natural Fertility.
ChatGPT provided answers to the first domain question similar to those provided by the CDC regarding infertility. The average length of responses provided by CDC and ChatGPT was the same.
After analyzing the credibility of the content provided by ChatGPT, we found no significant discrepancies between the CDC data and the responses generated by ChatGPT. No different emotional polarities or subjectivity were observed. Notably, only 6.12% of ChatGPT’s factual statements were identified as inaccurate, while one statement was cited as informative.
On the second domain, ChatGPT achieved a high score equivalent to 87.th Banting’s 2013 international cohort CFKS and 95th percentileth Percentiles based on Kudesia’s FIT-KS 2017 cohort. For all questions, ChatGPT provided background and rationale for answer choices. Additionally, ChatGPT produced only one non-conclusive answer, and the answer was considered neither correct nor incorrect.
In the third domain, ChatGPT reproduced the missing facts for all seven summary statements of ‘optimization of natural fertility’. For each response, ChatGPT highlighted facts removed from the statement and did not provide conflicting facts. Consistent results were obtained across all repeated doses in this area.
The current study has some limitations, such as evaluating only one version of ChatGPT. Similar models such as AI-powered Microsoft Bing and Google Bard have recently been launched to give patients access to alternative chatbots. Therefore, the nature and availability of these modes can change rapidly.
ChatGPT provides quick responses, but may utilize data from unreliable referrals. Additionally, model consistency may be affected during subsequent iterations. Therefore, it is also important to characterize model response variability using different update data.