I work a lot with pandas, munging and merging data, saving to new xlsx, csv, hdf files, depending on necessity. A common workflow for me is to load some tables (sometimes resources, than tables) to dataframes, work with data, and store the end product in other table(s) and resource(s).
I was planning to design a simple python module with a map of all the datapackages I use (locally and remotely) and that would allow me to query for tables inside resources and for individual series inside tables (and get them as dataframes or series, in one line of code). Is there already something similar developed? Would that be useful to anyone else?
This is a scrap, so if anyone has better ideas, please complement it. Briefly, I would like to:
- Be able to define Collections of Datapackages (a list of paths or urls, basically)
- Query for tables (description/metadata) and fields (description/metadata)
- Get the actual table (as pandas.Dataframe) or series (as pandas.Series).
Put simply, I thought of something like written below.
'''Manager of datapackages'''
Further I would like to store 'collections' in a permanent storage.
I would also like to add/remove new datapackages to collections.