Getting structured SEC EDGAR data

  1. Do you have any sense how large a full scrape of the data (the XML portion at least) might be?

I think it is pretty large but not absolutely sure. I think that public.resource.org might have quite a bit already done here for old stuff (pre 2001 IIRC) - https://bulk.resource.org/edgar/

  1. Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.

Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the “data” so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:

I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don’t think any of that is open-source AFAICT).

  1. Similarly, did you ever try out any of the Python tooling for XBRL?

Yes, and I actually managed to get one working. see https://github.com/datasets/edgar/tree/master/scripts