Background
I work a lot with pandas, munging and merging data, saving to new xlsx, csv, hdf files, depending on necessity. A common workflow for me is to load some tables (sometimes resources, than tables) to dataframes, work with data, and store the end product in other table(s) and resource(s).
I was planning to design a simple python module with a map of all the datapackages I use (locally and remotely) and that would allow me to query for tables inside resources and for individual series inside tables (and get them as dataframes or series, in one line of code). Is there already something similar developed? Would that be useful to anyone else?
Requirements
This is a scrap, so if anyone has better ideas, please complement it. Briefly, I would like to:
- Be able to define Collections of Datapackages (a list of paths or urls, basically)
- Query for tables (description/metadata) and fields (description/metadata)
- Get the actual table (as pandas.Dataframe) or series (as pandas.Series).
Initial proposal
Put simply, I thought of something like written below.
class DpsManager():
'''Manager of datapackages'''
collections={'brazil_datapackages':['./brazils_geografy.json','./brazils_economy.json']
}
def query_collection_for_table(collection,query_parameter):
return metadata_of_tables_matching_query_parameters
def query_collection_for_fields(collection,query_parameter):
return metadata_of_fields_matching_query_parameters
def get_table(resource_name,table_name):
return dataframe
def get_series(resource_name,table_name,field_name):
return series
Further I would like to store âcollectionsâ in a permanent storage.
I would also like to add/remove new datapackages to collections.
2 Likes
Great to hear from you and to hear about the workflow - it is always really useful to hear about what people are doing and want to do âin the wildâ.
I assume you are familiar with the Data Package python module:
What would you need added to that? Is it the ability to scan across lots of data packages at once?
Exactly that ability.
I delve deeper in the datackage-py after I wrote the post. I slightly changed my approach. Right now, I am trying the following:
- Add all my tabular data (in Excel as well as in HDFstores) to one datapackage.json.
- Each resource is uniquely identified by a path (HDFStore file or Excel file) AND a key (table key in the HDFstore or worksheet name in Excel files)
- Query the datapachage.descriptor[âresourcesâ] dict for a given query_param (string)
So, when I search for âbrazilâ, the query returns all resources which contains the âbrazilâ word in any value of a given set of its keys.
My intention with the post is to discuss which would be the best approach. By best, I mean one that would:
- Be in greater accordance to the Frictionless Data specifications.
- Integrate better with datapackage-py
Despite the approach I am trying, I still think that creating a collection of datapackages and scanning through all of them would be the âbestâ.
I imagine a use case where I am starting a specific study, I collect relevant datapackages to my study from across different sources (based solely on more superficial metadata), add them all to this studyâs Collection, than start a deeper examination of the data in it.
@pedrovgp having thought more about this, I also think you may be looking for a structure to list data packages. Something like a small âData Package Catalog specâ.
We have thought about this before in this spec issue: Discussion of Catalogs re Data Packages ¡ Issue #37 ¡ frictionlessdata/specs ¡ GitHub
We donât yet have a decided spec but you may want to chime in there.