Clinical decision support system for evidence base medicine

2025-06-28 16:30:49 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

Evidence-Based Medicine (EBM) is a form of medical practice that aims to develop decision making by emphasizing the use of evidence from well designed and conducted research. It ensures quality healthcare by using the best available information to answer questions in clinical practice. The purpose of this project is to develop an optimized search strategy for retrieving literature evidence to answer clinical questions. The database that is used is PubMed’s database, which has over 30 million citations from various sources such as; MEDLINE, books, and journals related to life science, bio medicine and health information. PubMed offers a vast range of literature on biomedical topics with an interface that is efficient and easy to use. However, the major drawback we see with PubMed is that with the growing literature and material on biomedical, it has become challenging for users to find relevant material quickly. It has been identified that for at least one-third of the queries on PubMed, get more than 100 citations [1]. We aim to tackle this issue and develop a search system that provides more concise and relevant articles to the query entered.We have opted to use the PICO framework to provide the user with an interface to form the query in a precise way to ensure that the results are relevant. Moreover, we have the plan to train our system on several different neural networks and classifiers to determine the best approach to automatically extract PICO elements from PubMed articles. We plan on implementing LSTM-CRF, CNN and Naïve Bayes to automatically detect PICO elements from a given text. LSTM-CRF has shown promising results in detection of the PICO elements and can be integrated with PubMed’s system in order to optimize the search process.

Project Objectives

To develop a search strategy that will help users to access PubMed efficiently and productively.
To provide a system that will be used for medical research and answer medical queries, which will be a great help for clinicians and researchers.
To provide an interface to the users that will allow them to define their questions using PICO.
To provide an optimized version of PubMed, which will retrieve and showcase relevant medical documents accurately.
To provide an optimized version of PICO framework by building a database for the faster retrieval of articles.

Project Implementation Method

First of all, we have to extract the documents from the PubMed Library so we can do XML parsing and easily be able to retrieve PMID which is the ID for that specific document, document name, and document abstract. The purpose of XML parsing is that we can easily extract the information that we want with the help of specific tags. The next step that comes is we have to make .txt files which will contain abstracts of the documents, and that .txt files will be named after the PMIDs. By this process, we will be able to store all the text files with their specific names that we require that will help us in the latter part of the project. Now comes the phase where we have to extract abstracts through those .txt files and have to convert them into Word2vec. Word2vec is a two-layer neural net that, by "vectorising" words, processes text. A text corpus is its input, and a collection of vectors is its output: feature vectors that represent words in that corpus. As Neural Network can understand the vectors so we have to implement the neural network which fits the best and provide us with an efficient result. For that purpose, we have to compare several neural networks and then choose the one which will give us the best solution. The neural networks that we will use for training our documents will be Long-Short-Term-Memory (LSTM), Convolutional Neural Network (CNN) and the classifier which will be used is Naive Bayes Classifier. Now comes the most important part of our project which will help us know which line of abstract belongs to which class of PICO. The PICO method (or framework) is a mnemonic used to frame and address a clinical or healthcare-related question in evidence-based practice (and especially Evidence-Based Medicine). For example, in systematic reviews, the PICO framework is often used to establish literature search strategies. Our sentences of documents will belong to one of the four classes which will help us in the next phase of the project. In our next phase, we have to form our database which will store the document along with its PMID and most important thing the characteristics which we get from the PICO framework. For example, we have a document with the name of its PMID containing the text and also the information that attributes like only P and O of the PICO framework belong to that specific document. So after the formation of our database having features of the PICO framework, we can efficiently search the queries and in this way, we will use the approach of optimized search. Now comes the part where the user will interact with our whole project. We will make an Application Programming Interface (API) where users will search queries relevant to the medical field. As we already stored documents in our local database along with their PICO framework characteristics we will be able to match our queries efficiently. After that, we will show abstracts of all the documents relevant to those queries.

Benefits of the Project

Evidence-based Medicine (EBM) through PubMed for clinicians becomes a time consuming and demanding process. One of the main concerns in systematic review is the emphasis on clinical queries to be precise. There is a need to optimize the searching and retrieval of articles from PubMed. We plan to optimize the searching by the introduction of the PICO framework. We have found a dataset that has the PICO elements identified; participants (P), interventions (I)/comparators (C), and outcomes (O). We will use this dataset to train the neural networks we will be using; Long-Short-TermMemory(LSTM), Convolutional neural network(CNN), and Naive Bayes Classifier. We will provide the users with an interface where they can break down their queries into the elements of the PICO framework. It will help in providing an optimized version of PubMed, where users will be able to retrieve and get relevant medical documents accurately.

Technical Details of Final Deliverable

We retrieved medical papers and journals while using the Entrez API. The Entrez Programming Utilities ae a set of eight programs that gives users an interface into the database and the Entrez query. The Entrez API was used to extract articles from the PubMed library. This API allows access to all Entrez databases which includes PubMed. We passed the parameters to extract articles related to one disease and then fetch those articles.

After the retrieval process we perform XML parsing of the retrieved abstracts and convert data into .txt files. The .txt files are named after the PMIDs. PMID is the PubMed ID. Each article in the PubMed’s database is given a unique ID for identification.

The next step was to vectorize the text files. We convert each word in the abstracts to vector form. We make use of word2vec for the vectorization process. The advantages of using Word2vec over the other two are that Word2Vec retains the semantic meaning of the words found in the document. The size of the vector is small as well and there is no need of huge vectors The Python Genism library will be used to implement Word2vec on the retrieved abstracts.

The first step in the vectorization process is to create a corpus. In our case, the corpus is the extracted medical articles. The articles are stored in a single variable. After this we preprocess our data. All the uppercase letters are converted to lowercase and later remove extra spaces, digits and special characters. We then tokenize the data by converting articles to sentences and then sentences to words. The Word2Vec class of Genism library is used now. First we specify certain parameters such as the minimum count of repetition of words to be included in the corpus. After these steps we successfully created the Word2vec model of the retrieved articles from PubMed.

After searching, the dataset that we found is a corpus with multi-level annotations of Patients, Intervention and Outcome.This dataset will be used to train our neural network. The corpus consists of 5,000 richly labelled abstracts of medical literature. Out of these, 200 abstracts have been annotated by people with vast medical knowledge and have categorized the abstract sentences into three labels i.e., P, I or O. Repetition of the same label within an abstract has also been taken care of by assigning binary labels and grouping together sub spans that were instances of the same information. The dataset has been given three labels word by word. I and C element are merged together because both belong to the same semantic group. The first column of the dataset is the sentence number, the second column is the word, and third column tells whether the word is a preposition, a verb or a noun. The last column defines the label assigned to each word. Four labels are defined in the dataset. These are P, I, O and N. N stands for none of these and is assigned to words that do not belong to any of the PICO labels. Below is the sample of the dataset that we used.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Medical , Health Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Good Health and Well-Being for People, Decent Work and Economic Growth, Partnerships to achieve the GoalRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	7532
HP S700 Pro 256GB SSD Series 2.5	Equipment	1	7532	7532

Clinical decision support system for evidence base medicine

More Posts