Dealing with large CSV files

Hello Everyone,

I added a csv file with ~2m rows, but I am experiencing some issues.
The sorting of rows for exemple is extremely slow, and the API will return a out of memory error if I try to query all values at once.
I would like to know about best practices when dealing with very big files, and what are the recomended hardware configurations (ex. RAM, HD, Processor) for these kinds of environments.

Thank-you for the attention.

Just to be clear: are you referring to the CKAN DataStore API?

I don’t have any experience with using the CKAN DataStore. But maybe CKAN is not the best environment to deal with very large files. You might need something like Dask or Hadoop to be able to handle large volumes of data.

Yes, we are using the CKAN DataStore API.

I see. Currently our API cant deal with big files, for exemple the one with 2 milion rows. When we try to query the whole dataset, it ends up returning a memory error, besides being very slow.

Some workarounds we thought would be:

  • If possible, disable the API for the big datasets;
  • Maybe submit the ZIP dataset for download, and a smalled version CSV for the preview?
  • Try and increase memory/processing of our server, even though we don’t know the exact specifications needed;
  • Last option, we would divide the dataset in pieces.

We will consider the Dask and Hadoop ideas, is it possible for CKAN to consume the data from a database configured to deal with these scenarios?

Offering the bulk file for download and a small sample for preview sounds pretty sensible to me.

I don’t think CKAN was ever conceived to do big data (larger than usually fits on memory) analyses and queries. So if you’re planning to do that using Dask, Hadoop, or other tools, that would be completely separate from CKAN.

(Un)relatedly, for the user (after download) https://visidata.org/ can be a life saver with such large CSV files.

Ok, I think we will have to keep the small preview CSV idea. The only problem is that the API won’t be correct than, since the small CSV will represent tha data used by the API too.

I was checking some CKANs around the world, and noticed this one:

https://data.wprdc.org/dataset/allegheny-county-tax-liens-filed-and-satisfied/resource/8cd32648-757c-4637-9076-85e144997ca8

Their API is able to handle the large dataset, even though it takes a while for large queryes, which makes me wonder how are they dealing with this… Probably a better infrastructure.