Datapackages management: get table and get series(fields) easily


#1

Background

I work a lot with pandas, munging and merging data, saving to new xlsx, csv, hdf files, depending on necessity. A common workflow for me is to load some tables (sometimes resources, than tables) to dataframes, work with data, and store the end product in other table(s) and resource(s).

I was planning to design a simple python module with a map of all the datapackages I use (locally and remotely) and that would allow me to query for tables inside resources and for individual series inside tables (and get them as dataframes or series, in one line of code). Is there already something similar developed? Would that be useful to anyone else?

Requirements

This is a scrap, so if anyone has better ideas, please complement it. Briefly, I would like to:

  1. Be able to define Collections of Datapackages (a list of paths or urls, basically)
  2. Query for tables (description/metadata) and fields (description/metadata)
  3. Get the actual table (as pandas.Dataframe) or series (as pandas.Series).

Initial proposal

Put simply, I thought of something like written below.

class DpsManager():
    '''Manager of datapackages'''

     collections={'brazil_datapackages':['./brazils_geografy.json','./brazils_economy.json']
}
    
    def query_collection_for_table(collection,query_parameter):
        return metadata_of_tables_matching_query_parameters       

    def query_collection_for_fields(collection,query_parameter):
        return metadata_of_fields_matching_query_parameters

    def get_table(resource_name,table_name):
        return dataframe

    def get_series(resource_name,table_name,field_name):
        return series

Further I would like to store ‘collections’ in a permanent storage.
I would also like to add/remove new datapackages to collections.


#2

Great to hear from you and to hear about the workflow - it is always really useful to hear about what people are doing and want to do “in the wild”.

I assume you are familiar with the Data Package python module:

What would you need added to that? Is it the ability to scan across lots of data packages at once?


#3

Exactly that ability.

I delve deeper in the datackage-py after I wrote the post. I slightly changed my approach. Right now, I am trying the following:

  1. Add all my tabular data (in Excel as well as in HDFstores) to one datapackage.json.
  2. Each resource is uniquely identified by a path (HDFStore file or Excel file) AND a key (table key in the HDFstore or worksheet name in Excel files)
  3. Query the datapachage.descriptor[‘resources’] dict for a given query_param (string)

So, when I search for “brazil”, the query returns all resources which contains the “brazil” word in any value of a given set of its keys.

My intention with the post is to discuss which would be the best approach. By best, I mean one that would:

  1. Be in greater accordance to the Frictionless Data specifications.
  2. Integrate better with datapackage-py

Despite the approach I am trying, I still think that creating a collection of datapackages and scanning through all of them would be the “best”.

I imagine a use case where I am starting a specific study, I collect relevant datapackages to my study from across different sources (based solely on more superficial metadata), add them all to this study’s Collection, than start a deeper examination of the data in it.


#4

@pedrovgp having thought more about this, I also think you may be looking for a structure to list data packages. Something like a small “Data Package Catalog spec”.

We have thought about this before in this spec issue: https://github.com/frictionlessdata/specs/issues/37

We don’t yet have a decided spec but you may want to chime in there.