Data modelling/structure decisions


#1

How would you recommend handlng these data modelling/structure/packaging issues in OpenSpending? I know it’s a few different issues that may be better treated as distinct topics, but they all relate to the shape of the data we’d import to OpenSpending so let’s try them together.

Budget, Adjusted Budget, Actual

Would you present Original Budget (ORGB), Adjusted Budget (AJDB), Pre-Audit Actual (PAUD), Audited Actual (AUDA) and FY+1 and +2 forecast values (IBY1, IBY2) values in the same dataset/datapackage? Would you show them as separate measures (amounts) for a given slice, or select them as different slices?

> select if.item_code,
         ii.label,
         if.amount_type_code,
         sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
  and if.item_code = ii.code
  and if.demarcation_code = 'CPT'
  and ii.code = '0200'
  and if.period_length = 'year'
  and financial_period = '2016'
 group by if.item_code, ii.label, if.amount_type_code;

 item_code |     label      | amount_type_code |    sum     
-----------+----------------+------------------+------------
 0200      | Property Rates | ADJB             | 7870033187
 0200      | Property Rates | AUDA             | 7838728035
 0200      | Property Rates | IBY1             | 8374827917
 0200      | Property Rates | IBY2             | 9011478508
 0200      | Property Rates | ORGB             | 7794277116
 0200      | Property Rates | PAUD             | 7838728035

Annual and monthly data

Would you include amounts of different period-lengths in the same dataset? They’re inherently different types of amounts - monthly values sometimes have budget values, usually have actual values, but never have audited or long term forecast values.

> select if.item_code,
   ii.label,
   if.amount_type_code,
   if.period_length,
   if.financial_period,
   if.financial_year,
   sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
  and if.item_code = ii.code
  and if.demarcation_code = 'CPT'
  and ii.code = '0200'
  and if.period_length in ('year', 'month')
  and if.financial_period in (2016, 1, 2)
  and financial_year = '2016'
 group by if.item_code, ii.label, if.amount_type_code, if.period_length, if.financial_period, if.financial_year;

 item_code |     label      | amount_type_code | period_length | financial_period | financial_year |    sum     
-----------+----------------+------------------+---------------+------------------+----------------+------------
 0200      | Property Rates | ACT              | month         |                1 |           2016 |  588206492
 0200      | Property Rates | ACT              | month         |                2 |           2016 |  671143669
 0200      | Property Rates | ADJB             | year          |             2016 |           2016 | 7870033187
 0200      | Property Rates | AUDA             | year          |             2016 |           2016 | 7838728035
 0200      | Property Rates | IBY1             | year          |             2016 |           2016 | 8374827917
 0200      | Property Rates | IBY2             | year          |             2016 |           2016 | 9011478508
 0200      | Property Rates | ORGB             | year          |             2016 |           2016 | 7794277116
 0200      | Property Rates | PAUD             | year          |             2016 |           2016 | 7838728035

Subtotals, Totals, Revenue and Expenditure

The data we get is structured as financial statements. It includes coded subtotals and totals, and sometimes subtracts some items from others.

Would you include such subtotals and totals? This makes it dangerous for users to blindly aggregate measures/amounts unless they explicitly exclude subtotals. On the other hand it adds important meaning to the data - but perhaps this is something that should be done as a presentation aspect when it’s relevant?

Some rows in an Income and Expenditure statement (dataset) are obviously revenue, and some are spending. Would you ensure spending values are negative when they’re included alongside revenue data? If one doesn’t, it’s again easy to calculate meaningless and misleading aggregates.

> select if.item_code,
         ii.label,
         ii.return_form_structure,
         substring(composition from 0 for 30),
         sum(if.amount)
from incexp_facts if, incexp_items ii, government_functions f
where if.function_code = f.code
  and if.item_code = ii.code
  and if.demarcation_code = 'CPT'
  and if.period_length = 'year'
  and financial_period = '2016'
group by if.item_code, ii.label, ii.return_form_structure, composition, position_in_return_form
order by ii.position_in_return_form;

item_code |                            label                             | return_form_structure |           substring           |     sum      
-----------+--------------------------------------------------------------+-----------------------+-------------------------------+--------------
 0100      | OPERATING REVENUE                                            | heading               |                               |            0
 0200      | Property Rates                                               | line_item             |                               |  48728072798
 0300      | Property Rates - Penalties And Collection Charges            | line_item             |                               |            0
 0400      | Service Charges                                              | line_item             |                               | 109110328174
 0700      | Rent Of Facilities And Equipment                             | line_item             |                               |   2252509443
 0800      | Interest Earned - External Investments                       | line_item             |                               |   2828602739
 1000      | Interest Earned - Outstanding Debtors                        | line_item             |                               |   1420935076
 1100      | Dividends Received                                           | line_item             |                               |            0
 1300      | Fines                                                        | line_item             |                               |   6258309694
 1400      | Licenses and Permits                                         | line_item             |                               |    248701070
 1500      | Agency Services                                              | line_item             |                               |   1008203709
 1600      | Transfers Recognised - Operating                             | line_item             |                               |  22555543024
 1610      | Transfers Recognised - Capital                               | line_item             |                               |  13808907499
 1700      | Other Revenue                                                | line_item             |                               |  15343400414
 1800      | Gain On Disposal Of Property, Plant & Equipment              | line_item             |                               |    582458170
 1900      | Total Operating Revenue Generated                            | subtotal              | sum(0200, 0300, 0400, 0700, 0 | 224145971810
 2000      | Less Revenue Foregone                                        | line_item             |                               |   7509865008
 2100      | Total Direct Operating Revenue                               | subtotal              | 1900 - 2000 or 1900 + 2000 de | 216636106802

Headings

We also have coded headings, as part of the structure of the originally submitted, and archived data, as shown in the subtotals example above. Would you include that or is that again purely a presentation aspect that you’d clean from the data before including in OpenSpending?

Including it in our Municipal Money/Data portal as a dimension makes it easy to programmatically produce structured views for various datasets (income and expenditure versus balance sheet) https://municipaldata.treasury.gov.za/table/cflow/?year=2015&municipalities=TSH&amountType=AUDA while we’d alternatively have to exclude the data from the API and document it so that each API user implements it themselves as needed.

Denormalisation/flattening

It seems that we need to denormalise our data to import into OpenSpending. The latest Income and Expenditure snapshot including only codes, no labels, is around 300MB for a few years of annual budget and actual data, as well as monthly actuals. Denormalising that with government function and line item labels would make it quite a lot larger. Is that something OpenSpending would cope with? Should I just not care about it and leave it to OpenSpending Packager to figure it out? It seems I can use datapackage-pipelines to do the join with label CSVs for flattening - do I understand correctly?

National and local government data

We don’t have an example of this yet for fiscal data but it’s quite relevant. Would you include different levels (or spheres in South Africa) of government in the same dataset/datapackage? Eg. add another dimension - geo_level - so that demarcation_code can contain the national code, the province codes, and municipal codes, separated by different geo_level codes.


#2

Hey there @jbothma!

As you know there are many different ways to model and structure a fiscal data file on OpenSpending, so the following answers are only my suggestions - and other solutions are possible.

Different phases: I would recommend creating a ‘Phase’ column with the name of the phase - and even better, two columns, one with a numeric phase id and another with the label. This is better (I believe) than having one measure per phase, as it allows more flexibility in visualisations and the API.

Annual and monthly data: From my understanding it feels that separate datasets are the way to go here - unless you see a use case to display both yearly and monthly data in the same chart / API results.

Subtotals etc.: My gut feeling is that you should remove them altogether - unless these contain information that cannot be extracted from the data. As for setting the correct sign for revenues when shown together with spending - yes, I would definitely do that.

Headings: I would take the heading and add it as a new column for each of its items, and not include it as a line of its own.

Denormalisation: I’m currently working on normalising the data while writing to the DB so that internally it is completely normalised. This is still WIP but will be done in the next few weeks. However, you’re right that atm the data needs to be denormalised when uploading to OpenSpending.
I know I might be biased, but using datapackage-pipelines for doing the upload to OpenSpending (and the denormialisation) would be the best way to do it :slight_smile:

National and local data: The answer to that really depends on the nature of the connection between the local and national governments. If there’s a strong connection (as in - you can connect budget items from the national and local spheres together), and local governments budget structures are similar enough, then sure thing. If the connections are weaker, then it’s something you need to think about and make your decision.

I hope that helped - either way I’m always at https://gitter.im/openspending/chat (as @adam.kariv) so please ping me there as well.


#3

Thanks for the thorough answer. It’s precisely opinions of more experienced people that I’m keen for so I really appreciate any feedback.

And well done getting OpenSpending Next up - it’s looking great!!

JD