Hey all, I wanted to get people’s feedback on scrapekit - a Python scraping utility that I’ve started working on.
The goal is to provide more automation than the traditional mix of requests and lxml/BeautifulSoup would give people - but at the same time to remain less prescriptive than scrapy.
A special feature I’m excited about is the reports generator: whenever a scraper terminates, it will generate a flat-file, HTML report on its last runs; detailing which sites were downloaded and what errors may have occurred.
This is important because it can be hard to find the right trade-off in writing scrapers between how lax you should be in handling small issues like missing fields, or temporary downtime.
I tried it briefly a couple of days ago and what I really liked was that it goes out of your way while still providing nice features in the background. Maybe I’m biased because I use the stack in scrapekit anyways, but it’s a good and common stack (requests, lxml).
Thanks for the hint with morph.io - I would really like to get it included there. One feature they should have is a “S3 sync after run”, where you can have a certain subdir of your scraper uploaded automatically to a bucket somewhere after each run
A bit late to the party here, but first of all thanks for creating this tool. A few months ago I tried building a set of scrapers with Node.js and the scraperjs library but I got nowhere due to my lack of familiarity with Node.js and modern Javascript. Now I’m starting over with scrapekit.
Is Python 3 on the radar? I pip installed scrapekit without any problem in a Python 3.4 virtualenv, only to find that there are several places in the code where minor (e.g. 2to3 stuff) fixes could make the library run on Python 3. Dependencies are compatible, I think?
I personally just don’t have enough of a use case for Python 3, but I’d be more than keen to accept any pull requests to that effect and to help with a bit of testing. As you say, it shouldn’t be too much work!