I was browsing around for info about scraping the SEC’s EDGAR database and delighted to see that some of the first results were your work on it [1], [2]. I’m thinking about looking into that data casually, and I was wondering whether you might have some help for me on a few questions:
Do you have any sense how large a full scrape of the data (the XML portion at least) might be?
Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.
Similarly, did you ever try out any of the Python tooling for XBRL?
Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.
Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the “data” so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:
I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don’t think any of that is open-source AFAICT).
Similarly, did you ever try out any of the Python tooling for XBRL?
A little update on this: I’ve been experimenting with this over the last week or so. The full XBRL-age download (i.e. post 2005) seems to be around 160 GB, but I’m currently also trying to download the SGML filing documents since 1995, which seems to be 250-750 GB (still downloading).