Copyright on data sources


The data I am looking at (history house prices) are part of a personal website accompanying a book published earlier.

What about IP / copyright? I guess I need to check with the author and his publisher if I can copy his dataset (regulary) and publish it slightly modified on

Do you have templates for this? I guess if we integrate a link to his original data it is in the authors interest to do this.
What’s the best approach?


This is a great question. You should definitely try to work out how the data is licensed (if at all) and especially if the data is open data. You should then record this info in the a License section of the README and use this, if relevant, to inform the license choice for the data package.

As a nice example you can see


The open data definition unfortunately does not even touch copyright questions, and seems to mainly refer to the packages after publishing them on okfn, not the source data.

As the data that would be the source for historic housing prices is not marked as ‘GPL’ or free to use, it is copyright protected by default - like most sources will be.

The source does not mention GPL or free to use, so we should close this, correct?


@andreas there is an important subtlety here in that there are 2 “layers” of potential rights:

  1. there are (possibly) rights in the source data (but there may well not be due to the size of the dataset etc).
  2. there are (possibly) separate (database) rights in the collection you create

The question of whether there are rights is a question about “rights” in databases and is reasonably complex - see for much more on this.

My point here is that you can apply a license (e.g. PDDL) to DB licensing any rights arising from (2) and note in the License section of the README that this the license you have applied on assumption that there are no rights arising in (1). If there are, then obviously you are in no position to license them and in that case the selected license will not apply to the dataset as a whole but only to the rights you have as data packager.


Yes, that is what I was referring to.

In regards to the information source: When you refer to size of datasets, for example, can you elaborate that? My understanding so far was if the creation of anything requires significant effort, then it is automatically copyright protected. So the size of the dataset does not matter as much as the complexity to generate it.

Am I mistaken?
Is there a list, reference, rule of thumb, something to read when copyrights might NOT exist?


@andreas you are right that it is not simple size but often some other measure of effort or originality that matters. Given that we are not lawyers (and even if we were we probably wouldn’t know for sure) I’d suggest we choose for the license field the license we can apply (and would choose) and note very clearly in the README if we are unsure about the underlying rights in the data (e.g. there is no clear license and there may be rights etc).

There are some obvious data sources (e.g. US Federal data) where we can be pretty confident data is in the public domain and can note that in the license section of the README (general point: explain our view in the README so users are clear on where they stand).


When its mentioned, we can be sure, agreed. The other case I don’t follow you, because just mentioning that the copyright’s not clear won’t release you of anything when you publish the data.

I’ll see if I can find a dataset that states that it is GPL / open.