Erick Lu
March 14, 2020 - GitHub Repo
Here, I will show you how NCBI E-utilities can be used to search for and download Pubmed abstracts. I use Python in conjunction with the NCBI E-utilities to download all the abstracts corresponding to a given search term and simultaneously parse the information contained in each abstract into a data science-friendly format.
Depending on what you search for in PubMed, you could be presented with thousands of abstracts that contain the keyword you used for your query (for example, try searching "cancer"). Finding the information you’re looking for can get a bit tedious when you have to manually click through each page of search results.
Using the NCBI E-utilities (Entrez Programming Utilities, https://www.ncbi.nlm.nih.gov/books/NBK25499/), you can retrieve and download abstracts associated with a PubMed search without having to sift through the user interface. Even better, this tool doesn’t require any software–its completely URL based. You craft "search" and "fetch" commands as URLs and open them in your browser window to access the abstracts.
We can automate the download process by programming a script in Python to construct the URLs, execute the "search" and "fetch" commands, and parse each part of the abstract (Authors, Journal, Date of publication, etc.) into a data file for downstream analysis. Text from each abstract can be analyzed to quickly extract numerical data or quantitative results.
Below, I will give a brief tutorial about how the tools work, and the code needed to automate the process using Python.
The two main E-util functions you will use are esearch
and efetch
.
First, esearch
runs a keyword search command on the PubMed database and retrieves IDs for each of the abstracts corresponding to the search. The actual information associated with the abstracts does not show up, only the IDs. You’re also given a query key
and web environment ID
.
Then, you input the query key
and web environment ID
into an efetch
call, which will “fetch” all the abstracts for that specific esearch
query.
Let’s say I want to search pubmed for "P2RY8", which is a receptor on B lymphocytes that I worked on during my PhD.
Here is the URL required to execute a PubMed esearch for P2RY8: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=P2RY8&retmax=50&usehistory=y
This was crafted by putting the following parameters together:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
is the backbone of the esearch function.db=pubmed
specifies that we will be searching the pubmed database. term=P2RY8
specifies what we will be searching pubmed for. Change this field to whatever you want to search for.retmax=50
specifies how many abstracts I want to return using the search.usehistory=y
will provide you with a QueryKey and WebEnv id that will let you fetch abstracts from this search.db=pubmed
argument.Copying and pasting the full URL into my web browser results in a webpage that looks like this (XML output):
The next step is to execute an efetch command by constructing a new URL. Using the WebEnv
and QueryKey
information given in the above esearch result, I will type the following efetch command into my browser:
Note: If you’re trying this right now, your esearch will have given you a different webenv variable. Make sure to input YOUR webenv variable in the efetch URL for it to work!
Here is an explanation for each aspect of the link I constructed above:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
is the backbone for a efetch command. Notice that the only difference from this and an esearch command is the part after “/eutils/”.db=pubmed
specifies the database, again.query_key=1
specifies the number that was given in the “querykey” field in the esearch result.webenv=NCID_1_5757184_130.14.18.48_9001_1579844644_2015135327_0MetA0_S_MegaStore
specifies the ID that was given in the esearch result. retmode=text
specifies that I want the abstracts to be written out in print. rettype=abstract
specifies that I want abstracts shown, as opposed to other types of info that can be given from a PubMed search.After inputting this link, you should observe the following output as a plaintext webpage:
Now, I can simply ctrl-F to sift through over three hundred abstracts. You can apply this simple two-step process whenever you’re tasked with searching through absurd amounts of Pubmed results.
Below, I show you how to perform this process in Python.
The packages we need include csv
to write to a csv file, re
in order to use regular expressions to extract information from esearch
results, urllib
to open and read urls, and time
in order to sleep for a couple seconds between requests so we don't get blocked.
import csv
import re
import urllib
from time import sleep
Now, we need to specify our parameters for the esearch
and efetch
calls. We can store each of the settings in their own individual variables in order to make it easy to customize the calls in the future.
query = 'P2RY8'
# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'
# esearch specific settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'
We can construct the search url by simply combining together each of the variables into a long string:
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print(search_url)
Now that we have the full esearch
url constructed, we can open the url using urllib.request.urlopen()
:
f = urllib.request.urlopen(search_url)
search_data = f.read().decode('utf-8')
search_data
Above, you can compare the raw text output of the esearch result with the image of what it looks like in the browser. You can see that the same syntax is used for the WebEnv
and QueryKey
sections. We will also need the total number of abstracts corresponding to the query, if we want to be able to retrieve all the abstracts. We can extract these items from the output using the regexes below:
# obtain total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])
# obtain webenv and querykey settings for efetch command
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]
total_abstract_count
We observe that there are 56 total abstracts that match the keyword "P2RY8".
fetch_webenv
fetch_querykey
Now that we have the WebEnv
and QueryKey
, we can run an efetch
command to obtain the abstracts. In order to do so, we must assign values to each of the parameters in a similar manner as we did above for esearch.
# other efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 20
retstart = 0
fetch_retstart = "&retstart=" + str(retstart)
fetch_retmax = "&retmax=" + str(retmax)
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"
The fully constructed efetch command using the above parameters, which should fetch 20 of the 56 total abstracts, is below:
fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
print(fetch_url)
Now, we can open the url the same way we did for esearch:
f = urllib.request.urlopen (fetch_url)
fetch_data = f.read().decode('utf-8')
fetch_data[1:3000]
Examining the text output, we see that individual abstracts are separated by 3 new lines (\n\n\n
). We can therefore use split() to generate a list in which each item is a separate abstract.
# splits the data into individual abstracts
abstracts = fetch_data.split("\n\n\n")
len(abstracts)
Because we had set retmax = 20
, we obtained 20 abstracts. Let's take a closer look at an individual abstract:
# print out the first abstract
abstracts[0]
We observe that the sections of the abstract are separated by 2 new lines in a row, denoted by \n\n
. We can again use split()
to further categorize each section of the abstract.
split_abstract = abstracts[1].split("\n\n")
split_abstract
len(split_abstract)
We see that the abstract has been split into 7 different items of information, corresponding to the journal name, date, authors, etc. Knowing this, we can construct a data frame in which each row represents an abstract, and each column represents a section of the abstract (journal, date, authors, etc.).
There were 56 total abstracts corresponding to the keyword search P2RY8. Above, we have only processed the first 20. In order to obtain all of the abstracts, we can construct a loop that will call efetch
while incrementing retstart
by 20 each iteration, until all the abstracts have been downloaded and added to the table.
import csv
import re
import urllib
from time import sleep
query = "P2RY8"
# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'
# esearch settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'
# call the esearch command for the query and read the web result
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print("this is the esearch command:\n" + search_url + "\n")
f = urllib.request.urlopen (search_url)
search_data = f.read().decode('utf-8')
# extract the total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])
# efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 20
retstart = 0
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"
# obtain webenv and querykey settings from the esearch results
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]
# call efetch commands using a loop until all abstracts are obtained
run = True
all_abstracts = list()
loop_counter = 1
while run:
print("this is efetch run number " + str(loop_counter))
loop_counter += 1
fetch_retstart = "&retstart=" + str(retstart)
fetch_retmax = "&retmax=" + str(retmax)
# create the efetch url
fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
print(fetch_url)
# open the efetch url
f = urllib.request.urlopen (fetch_url)
fetch_data = f.read().decode('utf-8')
# split the data into individual abstracts
abstracts = fetch_data.split("\n\n\n")
# append to the list all_abstracts
all_abstracts = all_abstracts+abstracts
print("a total of " + str(len(all_abstracts)) + " abstracts have been downloaded.\n")
# wait 2 seconds so we don't get blocked
sleep(2)
# update retstart to download the next chunk of abstracts
retstart = retstart + retmax
if retstart > total_abstract_count:
run = False
The script above should display the esearch command, as well as each individual efetch command used to download the data. We observe that 3 efetch commands were called, and that a total of 56 abstracts were downloaded and stored in the list all_abstracts
.
You might be wondering, why did we need a loop in the first place? Couldn't we have just set retmax
to 56 and just run efetch
only once? For this particular example, yes. However, the maximum value for retmax
is 500. For other keyword searches with over 500 abstracts, we must set retmax to 500 and loop until all the abstracts are downloaded.
len(all_abstracts)
Next, we will split each abstract from all_abstracts
into the categories: 'Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', and 'Misc'. After splitting each abstract, we will write the information to a csv file for downstream analysis.
with open("abstracts.csv", "wt") as abstracts_file:
abstract_writer = csv.writer(abstracts_file)
abstract_writer.writerow(['Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', 'Misc'])
#For each abstract, split into categories and write it to the csv file
for abstract in all_abstracts:
#To obtain categories, split every double newline.
split_abstract = abstract.split("\n\n")
abstract_writer.writerow(split_abstract)
Examining the resulting csv file, we notice that some abstracts are missing pieces of information. Some lack the author information, and some lack the abstract text entirely. We are only interested in abstracts with complete information, we should segregate the incomplete abstracts into a separate file. To do this, we can just include a simple if/else clause and write to two files. The code below should do so, in which the incomplete abstracts will be written to a file called partial_abstracts.csv
.
with open("abstracts.csv", "wt") as abstracts_file, open ("partial_abstracts.csv", "wt") as partial_abstracts:
# csv writer for full abstracts
abstract_writer = csv.writer(abstracts_file)
abstract_writer.writerow(['Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', 'Misc'])
# csv writer for partial abstracts
partial_abstract_writer = csv.writer(partial_abstracts)
#For each abstract, split into categories and write it to the csv file
for abstract in all_abstracts:
#To obtain categories, split every double newline.
split_abstract = abstract.split("\n\n")
if len(split_abstract) > 5:
abstract_writer.writerow(split_abstract)
else:
partial_abstract_writer.writerow(split_abstract)
Here, I have shown you how to use the NCBI E-utilities esearch
and efetch
to download abstracts from PubMed, as well as how to write a Python script to batch download all abstracts corresponding to a keyword search.
The file pubmed_extractor.py
in this repository contains the Python script that we wrote above, but will also allow the user to input their desired keyword search. Running the script by typing python pubmed_extractor.py
into your terminal should prompt you for a keyword to download PubMed abstracts for. The script will then dump your abstracts into abstracts.csv
.
Sayers E. The E-utilities In-Depth: Parameters, Syntax and More. 2009 May 29. In: Entrez Programming Utilities Help. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.Available from: http://www.ncbi.nlm.nih.gov/books/NBK25499/