There has been recent discussion around OpenSpending and OpenTrials (two projects at Open Knowledge International) on the need for a solid and well featured entity reconciliation service.
The service would help applications which depend on reference data, from country lists, company lists to budget classifications. Examples would be messy source data about party donations, procurement awards, or medicine names.
The service would provide support for de-duplication and re-classification of source data dimensions against the canonical reference data; and it would allow the construction of canonical lists from messy source data.
Such a service would be generally useful to the wider open data community, and in initial discussion between Friedrich Lindenberg, Mark Brough and Paul Walsh, we came to some shared understanding of what a service might look like at a high level.
To learn more about how others have approached this problem, we’re putting out a call: We are looking for existing work to build on, open-source tools for reference data. Is there open source code out there that meets many or all of our criteria? If no existing solution can be found, we hack on Nomenklatura (https://github.com/pudo/nomenklatura) to push it in this direction.
Features:
• Reconciliation endpoints for particular "collections"
• Geographical
• Budget taxonomies
• Companies
• Namespacing of data
• "collections" is a type of namespacing
• but collections need (?) additional context: such as geographical context for company names
• Distinct reconciliation strategies (possibly exposed as distinct methods of the API)
• Fuzzy, cross field matching
• Primary identifer matching
• Other?
• Read and write against "collections"
• Create the code list based on the data being reconciled ("get or create")
• Confidence level for matches
• Some control over confidence level ("give me the first match over 80% confidence")
• Hook into an array of data stores to match against, possibly mapped to "collections"
• web services (example: opencorporates)
• CSV (hosted somewhere)
• Other databases (connection with credentials)?
• Make higher level abstractions out of multiple data sources
• Example: automate the creation of a geo lookup service by mapping ocd division ids ([https://github.com/opencivicdata/ocd-division-ids](https://github.com/opencivicdata/ocd-division-ids)) onto data from geonames (??)
• Simple, modern web client for user-driven reconciliation of data