Getting structured SEC EDGAR data

@pudo wrote:

I was browsing around for info about scraping the SEC’s EDGAR database and delighted to see that some of the first results were your work on it [1], [2]. I’m thinking about looking into that data casually, and I was wondering whether you might have some help for me on a few questions:

  1. Do you have any sense how large a full scrape of the data (the XML portion at least) might be?

  2. Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.

  3. Similarly, did you ever try out any of the Python tooling for XBRL?

[1] The SEC EDGAR Database - Open Knowledge Labs
[2] GitHub - datasets/edgar: Securities and Exchange Commission (SEC) EDGAR database which contains regulatory filings from publicly-traded US corporations.
[3] python - Parsing EDGAR filings - Stack Overflow

  1. Do you have any sense how large a full scrape of the data (the XML portion at least) might be?

I think it is pretty large but not absolutely sure. I think that public.resource.org might have quite a bit already done here for old stuff (pre 2001 IIRC) - https://bulk.resource.org/edgar/

  1. Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.

Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the “data” so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:

I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don’t think any of that is open-source AFAICT).

  1. Similarly, did you ever try out any of the Python tooling for XBRL?

Yes, and I actually managed to get one working. see https://github.com/datasets/edgar/tree/master/scripts

A little update on this: I’ve been experimenting with this over the last week or so. The full XBRL-age download (i.e. post 2005) seems to be around 160 GB, but I’m currently also trying to download the SGML filing documents since 1995, which seems to be 250-750 GB (still downloading).

Just for fun: here’s how you can apparently do a scraper in Hadoop :slight_smile: edgar-oil-contracts/import_filings.py at master · pudo/edgar-oil-contracts · GitHub (used to be: https://github.com/pudo/edgar/blob/master/mredgar/import_filings.py) - it generates handy 20 GB chunks of SEC filings.

Does the XBRL include the full-text of the filing or just the structured data. (I assume the latter)

If so, how big do we estimate the full-text dump to be (huge I imagine!)

One thing I’m also keen to know is:

  1. The extent of the information you can get out of the XBRL
  2. The quality of that info (how many fields are missing, when does it get good etc)