- Do you have any sense how large a full scrape of the data (the XML portion at least) might be?
I think it is pretty large but not absolutely sure. I think that public.resource.org might have quite a bit already done here for old stuff (pre 2001 IIRC) - https://bulk.resource.org/edgar/
- Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.
Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the “data” so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:
I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don’t think any of that is open-source AFAICT).
- Similarly, did you ever try out any of the Python tooling for XBRL?
Yes, and I actually managed to get one working. see https://github.com/datasets/edgar/tree/master/scripts