How would you recommend handlng these data modelling/structure/packaging issues in OpenSpending? I know it’s a few different issues that may be better treated as distinct topics, but they all relate to the shape of the data we’d import to OpenSpending so let’s try them together.
Budget, Adjusted Budget, Actual
Would you present Original Budget (ORGB), Adjusted Budget (AJDB), Pre-Audit Actual (PAUD), Audited Actual (AUDA) and FY+1 and +2 forecast values (IBY1, IBY2) values in the same dataset/datapackage? Would you show them as separate measures (amounts) for a given slice, or select them as different slices?
> select if.item_code,
ii.label,
if.amount_type_code,
sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
and if.item_code = ii.code
and if.demarcation_code = 'CPT'
and ii.code = '0200'
and if.period_length = 'year'
and financial_period = '2016'
group by if.item_code, ii.label, if.amount_type_code;
item_code | label | amount_type_code | sum
-----------+----------------+------------------+------------
0200 | Property Rates | ADJB | 7870033187
0200 | Property Rates | AUDA | 7838728035
0200 | Property Rates | IBY1 | 8374827917
0200 | Property Rates | IBY2 | 9011478508
0200 | Property Rates | ORGB | 7794277116
0200 | Property Rates | PAUD | 7838728035
Annual and monthly data
Would you include amounts of different period-lengths in the same dataset? They’re inherently different types of amounts - monthly values sometimes have budget values, usually have actual values, but never have audited or long term forecast values.
> select if.item_code,
ii.label,
if.amount_type_code,
if.period_length,
if.financial_period,
if.financial_year,
sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
and if.item_code = ii.code
and if.demarcation_code = 'CPT'
and ii.code = '0200'
and if.period_length in ('year', 'month')
and if.financial_period in (2016, 1, 2)
and financial_year = '2016'
group by if.item_code, ii.label, if.amount_type_code, if.period_length, if.financial_period, if.financial_year;
item_code | label | amount_type_code | period_length | financial_period | financial_year | sum
-----------+----------------+------------------+---------------+------------------+----------------+------------
0200 | Property Rates | ACT | month | 1 | 2016 | 588206492
0200 | Property Rates | ACT | month | 2 | 2016 | 671143669
0200 | Property Rates | ADJB | year | 2016 | 2016 | 7870033187
0200 | Property Rates | AUDA | year | 2016 | 2016 | 7838728035
0200 | Property Rates | IBY1 | year | 2016 | 2016 | 8374827917
0200 | Property Rates | IBY2 | year | 2016 | 2016 | 9011478508
0200 | Property Rates | ORGB | year | 2016 | 2016 | 7794277116
0200 | Property Rates | PAUD | year | 2016 | 2016 | 7838728035
Subtotals, Totals, Revenue and Expenditure
The data we get is structured as financial statements. It includes coded subtotals and totals, and sometimes subtracts some items from others.
Would you include such subtotals and totals? This makes it dangerous for users to blindly aggregate measures/amounts unless they explicitly exclude subtotals. On the other hand it adds important meaning to the data - but perhaps this is something that should be done as a presentation aspect when it’s relevant?
Some rows in an Income and Expenditure statement (dataset) are obviously revenue, and some are spending. Would you ensure spending values are negative when they’re included alongside revenue data? If one doesn’t, it’s again easy to calculate meaningless and misleading aggregates.
> select if.item_code,
ii.label,
ii.return_form_structure,
substring(composition from 0 for 30),
sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
and if.item_code = ii.code
and if.demarcation_code = 'CPT'
and if.period_length = 'year'
and financial_period = '2016'
group by if.item_code, ii.label, ii.return_form_structure, composition, position_in_return_form
order by ii.position_in_return_form;
item_code | label | return_form_structure | substring | sum
-----------+--------------------------------------------------------------+-----------------------+-------------------------------+--------------
0100 | OPERATING REVENUE | heading | | 0
0200 | Property Rates | line_item | | 48728072798
0300 | Property Rates - Penalties And Collection Charges | line_item | | 0
0400 | Service Charges | line_item | | 109110328174
0700 | Rent Of Facilities And Equipment | line_item | | 2252509443
0800 | Interest Earned - External Investments | line_item | | 2828602739
1000 | Interest Earned - Outstanding Debtors | line_item | | 1420935076
1100 | Dividends Received | line_item | | 0
1300 | Fines | line_item | | 6258309694
1400 | Licenses and Permits | line_item | | 248701070
1500 | Agency Services | line_item | | 1008203709
1600 | Transfers Recognised - Operating | line_item | | 22555543024
1610 | Transfers Recognised - Capital | line_item | | 13808907499
1700 | Other Revenue | line_item | | 15343400414
1800 | Gain On Disposal Of Property, Plant & Equipment | line_item | | 582458170
1900 | Total Operating Revenue Generated | subtotal | sum(0200, 0300, 0400, 0700, 0 | 224145971810
2000 | Less Revenue Foregone | line_item | | 7509865008
2100 | Total Direct Operating Revenue | subtotal | 1900 - 2000 or 1900 + 2000 de | 216636106802
Headings
We also have coded headings, as part of the structure of the originally submitted, and archived data, as shown in the subtotals example above. Would you include that or is that again purely a presentation aspect that you’d clean from the data before including in OpenSpending?
Including it in our Municipal Money/Data portal as a dimension makes it easy to programmatically produce structured views for various datasets (income and expenditure versus balance sheet) Municipal Money Data - Cash Flow while we’d alternatively have to exclude the data from the API and document it so that each API user implements it themselves as needed.
Denormalisation/flattening
It seems that we need to denormalise our data to import into OpenSpending. The latest Income and Expenditure snapshot including only codes, no labels, is around 300MB for a few years of annual budget and actual data, as well as monthly actuals. Denormalising that with government function and line item labels would make it quite a lot larger. Is that something OpenSpending would cope with? Should I just not care about it and leave it to OpenSpending Packager to figure it out? It seems I can use datapackage-pipelines to do the join with label CSVs for flattening - do I understand correctly?
National and local government data
We don’t have an example of this yet for fiscal data but it’s quite relevant. Would you include different levels (or spheres in South Africa) of government in the same dataset/datapackage? Eg. add another dimension - geo_level - so that demarcation_code can contain the national code, the province codes, and municipal codes, separated by different geo_level codes.