Entity reconciliation services


#1

There has been recent discussion around OpenSpending and OpenTrials (two projects at Open Knowledge International) on the need for a solid and well featured entity reconciliation service.

The service would help applications which depend on reference data, from country lists, company lists to budget classifications. Examples would be messy source data about party donations, procurement awards, or medicine names.

The service would provide support for de-duplication and re-classification of source data dimensions against the canonical reference data; and it would allow the construction of canonical lists from messy source data.

Such a service would be generally useful to the wider open data community, and in initial discussion between Friedrich Lindenberg, Mark Brough and Paul Walsh, we came to some shared understanding of what a service might look like at a high level.

To learn more about how others have approached this problem, we’re putting out a call: We are looking for existing work to build on, open-source tools for reference data. Is there open source code out there that meets many or all of our criteria? If no existing solution can be found, we hack on Nomenklatura (https://github.com/pudo/nomenklatura) to push it in this direction.

Features:

• Reconciliation endpoints for particular "collections"
  • Geographical
  • Budget taxonomies
  • Companies
• Namespacing of data
  • "collections" is a type of namespacing
  • but collections need (?) additional context: such as geographical context for company names
• Distinct reconciliation strategies (possibly exposed as distinct methods of the API)
  • Fuzzy, cross field matching
  • Primary identifer matching
  • Other?
• Read and write against "collections"
• Create the code list based on the data being reconciled ("get or create")
• Confidence level for matches
• Some control over confidence level ("give me the first match over 80% confidence")
• Hook into an array of data stores to match against, possibly mapped to "collections"
  • web services (example: opencorporates)
  • CSV (hosted somewhere)
  • Other databases (connection with credentials)?
• Make higher level abstractions out of multiple data sources
  • Example: automate the creation of a geo lookup service by mapping ocd division ids ([https://github.com/opencivicdata/ocd-division-ids](https://github.com/opencivicdata/ocd-division-ids)) onto data from geonames (??)
• Simple, modern web client for user-driven reconciliation of data

#2

Just to say I think this is great.

I know @pudo is something of an expert here so I will defer to him a bit in terms of existing work but my 2c is that i haven’t seen anything great opensource beyond nomenklatura and reconcile-csv.


#3

Very interested in any developments on this. Have been using OpenRefine until now, which is a good start, but can still take a while to cleanse relatively small datasets.


#4

Paul … as per my email (figured out I did have a login after all!)

Not sure if it’s relevant or not, but I stumbled upon this repo recently.

It seems Christina Harlow (https://github.com/cmh2166) has done a lot of work in this space … so it might be worth piggy-backing on her efforts or at least making contact for her input.

Looking under her repo, there appear to be several recently maintained python-based tools …

Rgds.
Colum


#5

I sent a bunch of feedback on this to the OKFN Labs email list. The thread can be found here in the archive: https://lists.okfn.org/pipermail/okfn-labs/2016-January/thread.html

In the context of OpenTrials ( https://github.com/opentrials/opentrials/issues/8 ) I would have thought that something purpose built for matching on multiple fields like https://github.com/datamade/dedupe would be a better fit.

Tom