Based on discussions we’ve had, the general scope of the work, I think we should proceed by first defining the scope of a spike solution, and iterate out from that.
A good spike solution should enable use to get our data structure solid (OSEP-04) around a few ideal use cases, and demonstrate some very basic (but essential) value to the end user.
It might also be possible to have a small team working on a spike solution, while others concentrate on more general architectural and developer-user experience issues.
A proposed spike solution
Here is a really basic (yet essential) flow that a spike solution could aim to support:
- User has a single CSV file of spend data
- User interacts with Web UI to model this CSV (create an Open Spending Data Package)
- User uploads the (valid) Open Spending Data Package
- User can navigate to the data package directory (so, each data package would have an index.html added to it, which provides a formatted view of metadata/data, and a link to the raw data sources as a minimal API)
- When User’s Data Package is uploaded, an aggregation task runs on the package (need to define the most basic aggregation task on spend data)
- Once aggregation is completed on a Data Package, a link is provided to the aggregate sources (also as CSV), and a simple visualisation over that. (these links could be provided via the index.html of the data package)
This solution could be completed without any user auth/z service, but it wouldn’t be publicly usable (even for real user testing). So, we’d need to consider an auth/z microservice, which could be developed in parallel, or, just simply use oauth via Google or similar for now, just for the spike solution.
Also, some type of task queue would be needed (at least, a way to trigger the aggregation service when a new data package is uploaded). Even if this was mocked for a spike solution, this is another area that could be developed in parallel (and indeed, it is critical for the micro service approach generally).
So, the components of this solution would be:
- UI to model and load data (note that a CLI POC to load data has been developed, and would form the basis of a UI)
- S3 (or similar) backend to store data packages
- Micro service to aggregate data packages when they hit S3
- Port part of openspendingjs (treemap?) to work with the new data package aggregates
And in parallel, to either be directly integrated into the spike solution, or after the solution:
- Auth/z service
- pubsub / task queue service that would eventually bind all Open Spending micro services together
- Work out fine details of OpenSpending Data Package
- Get an idea of how/what type of APIs Open Spending will be able to offer over raw CSV files (and therefore start to spec out use cases for OLAP, arbitrary queries, etc.)
- Have a basis on which to plan migration of existing data to the new system