Good Crawling Scripts

The Canadian Government has to post their spending data in a format like this.

This isn’t like a nice machine parse-able format. Someone must have published a script to take pages like this and aggregate it into a CSV or JSON file.

3 Likes

It’s rather list of contracts then list of spendings (however: contracts are spendings) :wink:

1 Like

It is quite Easy to build a scraper ti convert the web page intorno a csv or other structured format, we do a simulare operativo for the advoce board of italian public administratins(see albopop.it) The corretta output format for the information you are mentioning is the w3c approved public contracts vocabulary. However It would be a cheaper and more correct aproach asking the administration itself to publish this data in a macchine readable form

1 Like

Currently we are looking at BeautifulSoup to pull in the data. Although we’re written a script in PHP too.

1 Like

Can it be written in R? I have one here that works… if you have R installed. You might have to change the last line where it sets the location to your desktop : “~/desktop”. If you want it somewhere else or get an error, you have to change that line to the path name of the place in your computer where you want to save it.
install.packages(“rvest”)
library(rvest)
url = “2015-2016 - 4th Quarter (January to March) - Disclosure of Contracts - Proactive Disclosure - PWGSC
canadian_contracts ← read_html(url) %>%
html_nodes(“.wet-boew-zebra”) %>%
html_table()
write.csv(canadian_contracts, “~/desktop”)

1 Like

I just installed this from scratch. Needed:
install.packages(“xml2”)

1 Like

There’s something missing in here, but quite interesting, to see how simple it almost is…

It started to mess up around the url. Not sure how to write that in R properly.

1 Like

Sorry for the slow reply - I wish this sent notifications when someone replies.

You should be able to copy/paste the URL exactly as it is in the browser. Make sure there are no line breaks or spaces in the URL after you copy it. You can also just paste the URL inside the parentheses.

Feel free to send the error message.

So partly it’s about wrapping my head around how r works. But also just seeing failures like this:

> install.packages("rvest")
Installing package into ‘/usr/local/lib/R/3.3/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://muug.ca/mirror/cran/src/contrib/rvest_0.3.2.tar.gz'
Content type 'application/x-gzip' length 1597137 bytes (1.5 MB)
==================================================
downloaded 1.5 MB

* installing *source* package ‘rvest’ ...
** package ‘rvest’ successfully unpacked and MD5 sums checked
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (rvest)

The downloaded source packages are in
	‘/private/var/folders/6b/v4yjrkf97tx832lc7byhwr7r0000gn/T/Rtmp22MyJs/downloaded_packages’
> library(rvest)
> url = "https://www.tpsgc-pwgsc.gc.ca/cgi-bin/proactive/cl.pl?lang=eng;SCR=L;Sort=0;PF=CL201516Q4.txt"
> canadian_contracts <- read_html(url)
> html_nodes(".wet-boew-zebra")
Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"
>  html_table()
Error in UseMethod("html_table") : 
  no applicable method for 'html_table' applied to an object of class "NULL"
> write.csv(canadian_contracts, "~/desktop")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class "c("xml_document", "xml_node")" to a data.frame
>

That’s not the best chunk of code to see how r works because it’s using a pipe operation (%>%).

From the error and your previous comment, it looks like you then need to load the xml2 library at the top of the script:

library(xml2)

I’m away from my computer today, but I can post the csv tomorrow if this doesn’t resolve the error.

Here is an updated script that installs and loads the xml2 package. Also, the data itself has commas, so I changed it to a tab separated text file. It should still import to Excel if you want that.

In case this script still doesn’t work, it’s here on Google Drive: canadian_contracts.tsv - Google Sheets

install.packages(“rvest”)
install.packages(“xml2”)
library(rvest)
library(xml2)
url = “2015-2016 - 4th Quarter (January to March) - Disclosure of Contracts - Proactive Disclosure - PWGSC
canadian_contracts ← read_html(url) %>%
html_nodes(“.wet-boew-zebra”) %>%
html_table()
write.table(canadian_contracts, “~/Desktop/canadian_contracts.tsv”,
sep=“\t”)

1 Like

Ok, so I was finally able to replicate this. Thanks! Realize now there were a few things I was getting stuck on. the %>% functionality was definitely one of those.

Is there a way to crawl pages multiple pages, like the parent page Reports - Disclosure of Contracts - Proactive Disclosure - PWGSC

There are thousands of these pages on various government sites.

Thanks!

Yes, you could crawl all these pages. I’m not sure there’s a simple way to do it generically for all the pages you mentioned. You may have to code a crawler individually for each page.

Basically, you would write a script that goes to the page with multiple links and saves that to an array. Then loop through the array with the script above (assuming the tables on each page have the same xpath address).

In most browsers you can find the xpath by right-clicking and selecting the appropriate option.

I have not dug in here in detail. @mgifford could you post an excerpt of the actual format in this thread …

My 2c is that the best scraping / parsing language for simple stuff is still python. But obviously it is usually best to go with whatever language you know well.

We are making some progress to parse this info. You can see our GitHub page here.

We’re parsing both a list of all corporations as well as going through a list of over 100 or so sets of data with procurement in it. Much of the scraping of contracting was done in PHP.

We may well use Node.JS to do the parsing based on peoples experience in the team.

@ryan thanks again for your support with R. I kept hoping there’s be a simpler process.

@rufuspollock Trouble is that there is more than one. We’re building a local cache of HTML files which we will be working from. Our expectation is that it will be faster to do any work locally if the content is all there.