Is it worth mentioning here the approaches taken to validate the score?
I.e. In both the Index and the Barometer each score is reviewed by at least one independent reviewer.
I believe in the 2015 Index, you had both country and domain reviewers right? So that the original crowdsourced score was cross-checked by someone looking at the country, and cross-checked by someone comparing it to other datasets of the same kind.
This is one approach to improve the validity of a score, and address the potential for coder bias.
In the two editions of the Barometer I worked on, we did check for outliers by running some summary statistics, and then double-checking scores and their justifications.
The alternative more statistical approach might be to collect multiple rankings from independent scorers, and to average these out. This is the approach taken, for example, in World Economic Forum surveys. However, it is (a) very costly; and (b) doesn’t guarantee that all the coders had equal interpretation of the question and is open to a range of other possible biases.
Hope these observations are useful.