Clinical Trials scraping


#1

For our Open Trials project, we are aiming to index and make links between different data sources on clinical trials, drugs, and health conditons. Toward this end, we’re looking to incorporate structured data from ClinicalTrials.gov. We know lots work has been done on scraping Clinical Trials in the past (including by Open Knowledge :smile:). We’ve come up with the following list on past work. Does anyone have experience here? Any pitfalls to avoid?



http://blog.ouseful.info/tag/clinicaltrials/
https://ep2013.europython.eu/conference/talks/heavybase-a-python-peer-to-peer-database-for-clinical-trials-and-biobanks
https://wwwcf2.nlm.nih.gov/nlm_eresources/eresources/search_database.cfm
https://cran.r-project.org/web/packages/rclinicaltrials/vignettes/basics.html
https://pypi.python.org/pypi/clinical_trials/1.1




Also this:

ContentMining and Clinical Trials from petermurrayrust


#2

There is a project called LinkedCT, which crawls and turns data from ClinicalTrials.gov into linked data, making links between different datasets including DrugBank, DailyMed, PubMed, Wikipedia, etc. However, I guess data on LinkedCT is not up-to-date.

You can find more details about the project in the paper at: ftp://ftp.cs.toronto.edu/csrg-technical-reports/596/LinkedCT.pdf


#3

Thanks @jgkim, excellent suggestion!


Previous scraping work for trial sources other than ClinicalTrials.gov
#4

Why would you scrape it instead of just downloading and parsing the XML?


#5

For clinicaltrials.gov specifically, yes, that is what we will be doing. We are using “scraping” but we generally mean “data acquisition” :).