Good Crawling Scripts

mgifford · January 23, 2017, 10:29am

The Canadian Government has to post their spending data in a format like this.

This isn’t like a nice machine parse-able format. Someone must have published a script to take pages like this and aggregate it into a CSV or JSON file.

kmikolajczyk · January 26, 2017, 7:27am

It’s rather list of contracts then list of spendings (however: contracts are spendings)

Cristiano_Longo · January 28, 2017, 12:23pm

It is quite Easy to build a scraper ti convert the web page intorno a csv or other structured format, we do a simulare operativo for the advoce board of italian public administratins(see albopop.it) The corretta output format for the information you are mentioning is the w3c approved public contracts vocabulary. However It would be a cheaper and more correct aproach asking the administration itself to publish this data in a macchine readable form

mgifford · January 28, 2017, 1:30pm

Currently we are looking at BeautifulSoup to pull in the data. Although we’re written a script in PHP too.

ryan · January 28, 2017, 2:40pm

Can it be written in R? I have one here that works… if you have R installed. You might have to change the last line where it sets the location to your desktop : “~/desktop”. If you want it somewhere else or get an error, you have to change that line to the path name of the place in your computer where you want to save it.
install.packages(“rvest”)
library(rvest)
url = “2015-2016 - 4th Quarter (January to March) - Disclosure of Contracts - Proactive Disclosure - PWGSC”
canadian_contracts ← read_html(url) %>%
html_nodes(“.wet-boew-zebra”) %>%
html_table()
write.csv(canadian_contracts, “~/desktop”)

mgifford · January 29, 2017, 8:11pm

I just installed this from scratch. Needed:
install.packages(“xml2”)

mgifford · January 29, 2017, 8:14pm

There’s something missing in here, but quite interesting, to see how simple it almost is…

It started to mess up around the url. Not sure how to write that in R properly.

ryan · February 11, 2017, 2:53pm

Sorry for the slow reply - I wish this sent notifications when someone replies.

You should be able to copy/paste the URL exactly as it is in the browser. Make sure there are no line breaks or spaces in the URL after you copy it. You can also just paste the URL inside the parentheses.

Feel free to send the error message.

mgifford · February 11, 2017, 6:29pm

So partly it’s about wrapping my head around how r works. But also just seeing failures like this:

> install.packages("rvest")
Installing package into ‘/usr/local/lib/R/3.3/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://muug.ca/mirror/cran/src/contrib/rvest_0.3.2.tar.gz'
Content type 'application/x-gzip' length 1597137 bytes (1.5 MB)
==================================================
downloaded 1.5 MB

* installing *source* package ‘rvest’ ...
** package ‘rvest’ successfully unpacked and MD5 sums checked
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (rvest)

The downloaded source packages are in
	‘/private/var/folders/6b/v4yjrkf97tx832lc7byhwr7r0000gn/T/Rtmp22MyJs/downloaded_packages’
> library(rvest)
> url = "https://www.tpsgc-pwgsc.gc.ca/cgi-bin/proactive/cl.pl?lang=eng;SCR=L;Sort=0;PF=CL201516Q4.txt"
> canadian_contracts <- read_html(url)
> html_nodes(".wet-boew-zebra")
Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"
>  html_table()
Error in UseMethod("html_table") : 
  no applicable method for 'html_table' applied to an object of class "NULL"
> write.csv(canadian_contracts, "~/desktop")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class "c("xml_document", "xml_node")" to a data.frame
>

ryan · February 12, 2017, 3:36pm

That’s not the best chunk of code to see how r works because it’s using a pipe operation (%>%).

From the error and your previous comment, it looks like you then need to load the xml2 library at the top of the script:

library(xml2)

I’m away from my computer today, but I can post the csv tomorrow if this doesn’t resolve the error.

ryan · February 14, 2017, 4:56pm

Here is an updated script that installs and loads the xml2 package. Also, the data itself has commas, so I changed it to a tab separated text file. It should still import to Excel if you want that.

In case this script still doesn’t work, it’s here on Google Drive: canadian_contracts.tsv - Google Sheets

install.packages(“rvest”)
install.packages(“xml2”)
library(rvest)
library(xml2)
url = “2015-2016 - 4th Quarter (January to March) - Disclosure of Contracts - Proactive Disclosure - PWGSC”
canadian_contracts ← read_html(url) %>%
html_nodes(“.wet-boew-zebra”) %>%
html_table()
write.table(canadian_contracts, “~/Desktop/canadian_contracts.tsv”,
sep=“\t”)

mgifford · February 14, 2017, 10:43pm

Ok, so I was finally able to replicate this. Thanks! Realize now there were a few things I was getting stuck on. the %>% functionality was definitely one of those.

Is there a way to crawl pages multiple pages, like the parent page Reports - Disclosure of Contracts - Proactive Disclosure - PWGSC

There are thousands of these pages on various government sites.

Thanks!

ryan · February 15, 2017, 1:01am

Yes, you could crawl all these pages. I’m not sure there’s a simple way to do it generically for all the pages you mentioned. You may have to code a crawler individually for each page.

Basically, you would write a script that goes to the page with multiple links and saves that to an array. Then loop through the array with the script above (assuming the tables on each page have the same xpath address).

In most browsers you can find the xpath by right-clicking and selecting the appropriate option.

rufuspollock · February 15, 2017, 1:38am

I have not dug in here in detail. @mgifford could you post an excerpt of the actual format in this thread …

My 2c is that the best scraping / parsing language for simple stuff is still python. But obviously it is usually best to go with whatever language you know well.

mgifford · February 15, 2017, 3:33pm

We are making some progress to parse this info. You can see our GitHub page here.

We’re parsing both a list of all corporations as well as going through a list of over 100 or so sets of data with procurement in it. Much of the scraping of contracting was done in PHP.

We may well use Node.JS to do the parsing based on peoples experience in the team.

@ryan thanks again for your support with R. I kept hoping there’s be a simpler process.

@rufuspollock Trouble is that there is more than one. We’re building a local cache of HTML files which we will be working from. Our expectation is that it will be faster to do any work locally if the content is all there.

Topic		Replies	Views
Entry for National Laws / Australia Global Open Data Index 2016	4	1101	March 15, 2017
Entry for Government Spending / Ukraine Global Open Data Index 2016	0	730	December 4, 2016
How to handle big data (OpenSpending in Ukraine) OpenSpending	4	1741	December 4, 2016
Entry for Government Spending / Colombia Global Open Data Index 2016	13	1557	May 26, 2017
Entry for spending / jp Global Open Data Index 2016	12	1341	June 5, 2017

Good Crawling Scripts

Related topics