Clinical Trials scraping

For our Open Trials project, we are aiming to index and make links between different data sources on clinical trials, drugs, and health conditons. Toward this end, we’re looking to incorporate structured data from ClinicalTrials.gov. We know lots work has been done on scraping Clinical Trials in the past (including by Open Knowledge :smile:). We’ve come up with the following list on past work. Does anyone have experience here? Any pitfalls to avoid?

https://wwwcf2.nlm.nih.gov/nlm_eresources/eresources/search_database.cfm
https://cran.r-project.org/web/packages/rclinicaltrials/vignettes/basics.html

https://github.com/tinfante/ClinicalTrialsScraper
https://classic.scraperwiki.com/views/clinicaltrialsgov_test/

Also this:

2 Likes

There is a project called LinkedCT, which crawls and turns data from ClinicalTrials.gov into linked data, making links between different datasets including DrugBank, DailyMed, PubMed, Wikipedia, etc. However, I guess data on LinkedCT is not up-to-date.

You can find more details about the project in the paper at: ftp://ftp.cs.toronto.edu/csrg-technical-reports/596/LinkedCT.pdf

1 Like

Thanks @jgkim, excellent suggestion!

1 Like

Why would you scrape it instead of just downloading and parsing the XML?

For clinicaltrials.gov specifically, yes, that is what we will be doing. We are using “scraping” but we generally mean “data acquisition” :).