Scrapekit: request for feedback

pudo · August 30, 2014, 2:17pm

Hey all, I wanted to get people’s feedback on scrapekit - a Python scraping utility that I’ve started working on.

The goal is to provide more automation than the traditional mix of requests and lxml/BeautifulSoup would give people - but at the same time to remain less prescriptive than scrapy.

A special feature I’m excited about is the reports generator: whenever a scraper terminates, it will generate a flat-file, HTML report on its last runs; detailing which sites were downloaded and what errors may have occurred.

This is important because it can be hard to find the right trade-off in writing scrapers between how lax you should be in handling small issues like missing fields, or temporary downtime.

stwe · August 30, 2014, 7:01pm

I tried it briefly a couple of days ago and what I really liked was that it goes out of your way while still providing nice features in the background. Maybe I’m biased because I use the stack in scrapekit anyways, but it’s a good and common stack (requests, lxml).

Possibly see Marians blog post on his stack for some more inspiration.

And put scrapekit into the morph.io stack: GitHub - openaustralia/morph-docker-python: Docker image for running Python scrapers in Morph

pudo · September 1, 2014, 8:56am

Thanks for the hint with morph.io - I would really like to get it included there. One feature they should have is a “S3 sync after run”, where you can have a certain subdir of your scraper uploaded automatically to a bucket somewhere after each run

rufuspollock · September 2, 2014, 8:47am

+1 on the push (or sync) of a subdir after each run

steko · November 29, 2014, 10:40am

A bit late to the party here, but first of all thanks for creating this tool. A few months ago I tried building a set of scrapers with Node.js and the scraperjs library but I got nowhere due to my lack of familiarity with Node.js and modern Javascript. Now I’m starting over with scrapekit.

Is Python 3 on the radar? I pip installed scrapekit without any problem in a Python 3.4 virtualenv, only to find that there are several places in the code where minor (e.g. 2to3 stuff) fixes could make the library run on Python 3. Dependencies are compatible, I think?

pudo · December 1, 2014, 5:52am

I personally just don’t have enough of a use case for Python 3, but I’d be more than keen to accept any pull requests to that effect and to help with a bit of testing. As you say, it shouldn’t be too much work!

Topic		Replies	Views
Tool for collaborating on small open data - looking for feedback Open Knowledge Labs open-data	20	2245	March 25, 2017
Frictionless Data Transport in Python Open Knowledge Labs	5	1381	May 17, 2016
Data packages with R Frictionless Data opendata	6	2325	July 14, 2016
Winter maintenance of OKFN projects Open Knowledge Labs	8	1135	April 22, 2021
Open Science/Open Knowledge Directory Open Science	19	4451	July 24, 2015

Scrapekit: request for feedback

Related topics