Dealing with large CSV files

felcastro · September 3, 2019, 6:08pm

Hello Everyone,

I added a csv file with ~2m rows, but I am experiencing some issues.
The sorting of rows for exemple is extremely slow, and the API will return a out of memory error if I try to query all values at once.
I would like to know about best practices when dealing with very big files, and what are the recomended hardware configurations (ex. RAM, HD, Processor) for these kinds of environments.

Thank-you for the attention.

herrmann · September 4, 2019, 1:42pm

Just to be clear: are you referring to the CKAN DataStore API?

I don’t have any experience with using the CKAN DataStore. But maybe CKAN is not the best environment to deal with very large files. You might need something like Dask or Hadoop to be able to handle large volumes of data.

felcastro · September 4, 2019, 2:21pm

Yes, we are using the CKAN DataStore API.

I see. Currently our API cant deal with big files, for exemple the one with 2 milion rows. When we try to query the whole dataset, it ends up returning a memory error, besides being very slow.

Some workarounds we thought would be:

If possible, disable the API for the big datasets;
Maybe submit the ZIP dataset for download, and a smalled version CSV for the preview?
Try and increase memory/processing of our server, even though we don’t know the exact specifications needed;
Last option, we would divide the dataset in pieces.

We will consider the Dask and Hadoop ideas, is it possible for CKAN to consume the data from a database configured to deal with these scenarios?

herrmann · September 6, 2019, 2:08pm

Offering the bulk file for download and a small sample for preview sounds pretty sensible to me.

I don’t think CKAN was ever conceived to do big data (larger than usually fits on memory) analyses and queries. So if you’re planning to do that using Dask, Hadoop, or other tools, that would be completely separate from CKAN.

Nemo_bis · September 7, 2019, 3:55pm

(Un)relatedly, for the user (after download) https://visidata.org/ can be a life saver with such large CSV files.

felcastro · September 9, 2019, 11:42am

Ok, I think we will have to keep the small preview CSV idea. The only problem is that the API won’t be correct than, since the small CSV will represent tha data used by the API too.

I was checking some CKANs around the world, and noticed this one:

https://data.wprdc.org/dataset/allegheny-county-tax-liens-filed-and-satisfied/resource/8cd32648-757c-4637-9076-85e144997ca8

Their API is able to handle the large dataset, even though it takes a while for large queryes, which makes me wonder how are they dealing with this… Probably a better infrastructure.

Topic		Replies	Views
Datastore search API CKAN	2	1466	August 16, 2017
Counting tabular and map datasets in CKAN CKAN api	1	1148	August 31, 2019
CKAN Server Azure or CKAN APIs? CKAN open-data	2	1648	February 28, 2018
Server requirements for CKAN CKAN	1	1954	March 29, 2016
Running CKAN "headless" CKAN	0	1153	September 7, 2018

Dealing with large CSV files

Related topics