index: [HOME] [Collecting data] [Deduplication] [Screening]
We retrieve data from EMBASE via OVID, MEDLINE via PubMed, BioRxiv and MedRxiv.
When searches are updated, references that are identified that were not in the database before, are inserted by date (date_entrez) they were indexed in remote database, the date they are inserted in OUR database is formatted as the ‘strategydate’ (raw data is available here).
EMBASE:
(exp SARS-related coronavirus/ or severe acute respiratory syndrome/ or coronavirus disease 2019/ or (coronavir* or corona virus* or HCoV* or ncov* or 2019 cov or covid or covid19 or sars-cov* or sarscov* or sars-coronavirus* or Severe Acute Respiratory Syndrome Coronavirus* or nCoV).mp.) and 20191101:20301231.(dc).
MEDLINE:
("severe acute respiratory syndrome coronavirus 2"[Supplementary Concept] OR "COVID-19" [Supplementary Concept] OR "coronavirus" OR "corona virus" OR "HCoV" OR "nCoV" OR "2019 CoV" OR "covid" OR "covid19" OR "Severe Acute Respiratory Syndrome Coronavirus 2" OR "SARS-CoV2" OR "SARS-CoV 2" OR "SARS Coronavirus 2") AND (2019/11/01:3000/12/31[PDAT])
EMBASE:
(SARS coronavirus/ or middle east respiratory syndrome/ or severe acute respiratory syndrome/ or (coronavirus* or corona virus* or HCoV* or ncov* or covid or covid19 or sars-cov* or sarscov* or Sars-coronavirus* or Severe Acute Respiratory Syndrome Coronavirus*).mp.) and 20191201:20301231.(dc).
MEDLINE:
("coronavirus"[MH] OR "coronavirus infections"[MH] OR "coronavirus"[TW] OR "corona virus"[TW] OR "HCoV"[TW] OR "nCov"[TW] OR "covid"[TW] OR "covid19"[TW] OR "Severe Acute Respiratory Syndrome Coronavirus 2"[TW] OR "SARS-CoV2"[TW] OR "SARS-CoV 2"[TW] OR "SARS Coronavirus 2"[TW] OR "MERS-CoV"[TW]) AND (2019/1/1:3000[PDAT])
From 01.04.2020, we retrieve the currate BioRxiv/MedRxiv dataset Link
MEDLINE:
("Wuhan coronavirus" [Supplementary Concept] OR "COVID-19" OR SARS-CoV-2 OR "2019 ncov"[tiab] OR (("novel coronavirus"[tiab] OR "new coronavirus"[tiab]) AND (wuhan[tiab] OR 2019[tiab])) OR 2019-nCoV[All Fields] OR (wuhan[tiab] AND coronavirus[tiab]))
EMBASE:
(nCoV or 2019-nCoV or ((new or novel or wuhan) adj3 coronavirus) or covid19 or covid-19 or SARS-CoV-2).mp.
BioRxiv/MedRxiv:
ncov or corona or wuhan or COVID or SARS-CoV-2
With the kind support of the Public Health & Primary Care Library PHC, and following guidance of the Medical Library Association
MEDLINE:
("Wuhan coronavirus" [Supplementary Concept] OR "COVID-19" OR "2019 ncov"[tiab] OR (("novel coronavirus"[tiab] OR "new coronavirus"[tiab]) AND (wuhan[tiab] OR 2019[tiab])) OR 2019-nCoV[All Fields] OR (wuhan[tiab] AND coronavirus[tiab])))))
EMBASE:
ncov OR (wuhan AND corona) OR COVID
BioRxiv/MedRxiv:
ncov or corona or wuhan or COVID
Here we describe how the data is formatted.
We sort the information based on when the data was entered into the external database (MEDLINE, EMBASE, Rxiv), since publication date information is not always complete nor reliable.
PubMed is accessible through the package RISMed. We still need to process the data to a prespecified format.
library("RISmed")
PMquery="(\"Wuhan coronavirus\" [Supplementary Concept] OR \"2019 ncov\"[tiab] OR ((\"novel coronavirus\"[tiab] OR \"new coronavirus\"[tiab]) AND (wuhan[tiab] OR 2019[tiab])) OR 2019-nCoV[All Fields] OR (wuhan[tiab] AND coronavirus[tiab])))))"
Searchpubmed<-function(query){
  search_query <- EUtilsSummary(query, retmax=1000) # we can restrict time with: , mindate=2012, maxdate=2019)
  records<- EUtilsGet(search_query)
  
  #concatenate authors:
  authors<-vector()
  author1<-vector()
  for(n in 1:length(records@Author)){
    tmp<-paste0(records@Author[[n]]$LastName,", ",records@Author[[n]]$Initials)
    tmp2<-paste(tmp, collapse="; ")
    authors<-append(authors,tmp2)
    author1<-append(author1,records@Author[[n]]$LastName[1])
  }
  
  # we need to add additional fields
  pubmed_data <- data.frame('authors'=authors, 'title'=ArticleTitle(records),
                            'abstract'=AbstractText(records),'journal'=MedlineTA(records), 'pages'=MedlinePgn(records),
                            'volume'=Volume(records),'issue'=Issue(records),'year'=YearPubmed(records),
                            'pmid'=PMID(records), 'doi'=ELocationID(records), 
                            'url1'=paste0("https://www.ncbi.nlm.nih.gov/pubmed/",PMID(records)), 'author1'=author1) 
  
 return(pubmed_data) 
}  
Here, we opted for a very simple solution. We grab all DOIs of the last 75 record, and requery these to get the complete citation data:
prep.doi.biorxiv<-function(population){
  pops<-html_text(population)
  doi<-pops[seq(3,length(pops),by=4)]
  doi<-paste0("10.1101/",doi)
  doi <- str_remove(doi, "; ")
}
biorxivmed <- function(url){
  dois <- url %>%
    xml2::read_html() %>%
    html_nodes('.highwire-cite-metadata')
  doi <- prep.doi.biorxiv(dois)
  list.all.doi <- c()
  for (i in 1:length(doi)){
    cat("nr:",i,"doi:",doi[i],"status:")
    doi.get <- get_doi(doi[i])
    cat("retrieved \n")
    list.all.doi <- append(list.all.doi, doi.get)
  }
  return(list.all.doi)
}
get_doi<-function(doi) { #https://rdrr.io/github/wkmor1/doi2bib/src/R/doi2bib.r
  content(
    GET(
      url    = paste0("http://dx.doi.org/", doi),
      config = accept("application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0")
    ),
    as = "text",
    encoding = "UTF-8"
  )
}
We want to return a dataframe that is uniform for all different sources:
doijson <- function(jsondoi){ # returns dataframe from json 
  authors <- vector()
  author1 <- vector()
  for(n in 1:nrow(jsondoi$author)){
    tmp1 <- paste0(jsondoi$author$family[n],", ", gsub("[^A-Z]*([A-Z])[^A-Z]*", "\\1", jsondoi$author$given[n]))
    authors <- append(authors,tmp1)
    authors <- paste(authors, collapse="; ")
    author1 <- jsondoi$author$family[1]
  }
  
  # we need to add additional fields
  
  # some checks for empty fields:
  abstract=ifelse(is.null(jsondoi$abstract),NA,jsondoi$abstract) # check for empty
  journal=ifelse(is.list(jsondoi$`container-title`) & length(jsondoi$`container-title`) == 0,NA,jsondoi$`container-title`) # check for empty
  
  doi_data <- data.frame('authors'=authors, 'title'= jsondoi$title, 'author1'=author1, 'abstract'= str_remove(abstract, "<jats:p>"),
                         'doi'=jsondoi$DOI, 'url1'= jsondoi$URL, 'journal'= journal, "year" = jsondoi$posted$`date-parts`[1], 
                         "month" = jsondoi$posted$`date-parts`[2], "date_entrez" = jsondoi$posted$`date-parts`[3],
                         "institution" = jsondoi$institution$name, "publisher" = jsondoi$publisher)  # add other fields.
  return(doi_data)
}