Next MVP: CSV to Data Package to Aggregate to Visualization

[
  {
    "columns": ["func1"],
    "year": "2006",
    "aggregate": "approved",
    "where": "transfer = 'Excluding transfers' AND Econ_func = 'Function'",
    "file": "by-func1-2006",
    "format": "csv"
  }
]

In the recent tech hangout, there was much concern over building a simple aggregation spec that used raw SQL to define aggregates. I was also concerned with some of the redundancy inherent in this approach (e.g. “SELECT a,b,c … GROUP BY a,b,c”). In trying to build an script that imports fully normalized datapackages and runs aggregates on the data, I, of course, ran into the problem of needing to join lots of tables.

It turns out that when using SQLAlchemy, it is easier to use a more structured query specification to define these aggregates while also eliminating the redundant GROUP BYs. The above is what I came up with while working on a particular dataset that happened to have a year column and needed special conditions (e.g. “transfer = ‘Excluding transfers’ AND Econ_func = ‘Function’”) to generate a coherent aggregate.

What do you think? Should the “aggregate” column be renamed? How should I handle specific, uncommon conditions without creating a brand new Query Language?