Getting structured SEC EDGAR data

rufuspollock · October 10, 2014, 9:09am

I was browsing around for info about scraping the SEC’s EDGAR database and delighted to see that some of the first results were your work on it [1], [2]. I’m thinking about looking into that data casually, and I was wondering whether you might have some help for me on a few questions:

Do you have any sense how large a full scrape of the data (the XML portion at least) might be?
Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.
Similarly, did you ever try out any of the Python tooling for XBRL?

[1] The SEC EDGAR Database - Open Knowledge Labs
[2] GitHub - datasets/edgar: Securities and Exchange Commission (SEC) EDGAR database which contains regulatory filings from publicly-traded US corporations.
[3] python - Parsing EDGAR filings - Stack Overflow

rufuspollock · October 10, 2014, 9:11am

Do you have any sense how large a full scrape of the data (the XML portion at least) might be?

I think it is pretty large but not absolutely sure. I think that public.resource.org might have quite a bit already done here for old stuff (pre 2001 IIRC) - https://bulk.resource.org/edgar/

Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.

Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the “data” so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:

I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don’t think any of that is open-source AFAICT).

Similarly, did you ever try out any of the Python tooling for XBRL?

Yes, and I actually managed to get one working. see https://github.com/datasets/edgar/tree/master/scripts

pudo · October 10, 2014, 9:27am

A little update on this: I’ve been experimenting with this over the last week or so. The full XBRL-age download (i.e. post 2005) seems to be around 160 GB, but I’m currently also trying to download the SGML filing documents since 1995, which seems to be 250-750 GB (still downloading).

Just for fun: here’s how you can apparently do a scraper in Hadoop edgar-oil-contracts/import_filings.py at master · pudo/edgar-oil-contracts · GitHub (used to be: https://github.com/pudo/edgar/blob/master/mredgar/import_filings.py) - it generates handy 20 GB chunks of SEC filings.

rufuspollock · October 10, 2014, 9:42am

Does the XBRL include the full-text of the filing or just the structured data. (I assume the latter)

If so, how big do we estimate the full-text dump to be (huge I imagine!)

One thing I’m also keen to know is:

The extent of the information you can get out of the XBRL
The quality of that info (how many fields are missing, when does it get good etc)

Topic		Replies	Views
Fiscal Data Package versus XBRL? OpenSpending	14	2413	May 17, 2021
Is there a way to download the entire dataset directly? OpenSpending	7	1243	August 12, 2015
Entry for National Laws / Australia Global Open Data Index 2016	4	1099	March 15, 2017
Entry for procurement / pl Global Open Data Index 2016	5	841	May 28, 2017
Product Open Data Twenty Sixteen Open Product Data labs-hangout	4	2141	June 19, 2016

Getting structured SEC EDGAR data

Related topics