A (non-viral) copyleft/sharealike license for open research data

by Felix Schönbrodt & Roland Ramthun
The open availability of scientific material (such as research data, code, or other material) has often been identified as one cornerstone of a trustworthy, reproducible, and verifiable science. At the same time, the actual availability of such reproducible material still is scarce (though on the rise).
To increase the availability of open scientific material, we propose a license for scientific research data that increases the availability of other open scientific material. It borrows a mechanism from open source software development: The application of copyleft (or, in the CC terminology, “sharealike”) licenses. These are so-called “sticky licenses”, because they require that every reuse of the licensed material has to have the same license. This means, if you reuse material under this license, your own product/derivative must also (a) be freely reusable and (b) use that license, so that any derivative from your product is free as well, ad infinitum.
The promise of such a “viral” license is that it can induce more and more freedom into a system. It is supposed to be a strategy to reform the environment: The more artifacts have a copyleft license, the more likely it is that future products have the same license, until, at the end, everything is free.

Picture of a viral license by Phoebus87 (https://de.wikipedia.org/wiki/Datei:Symian_virus.png)

One criticism of such licenses stems from the definition of “freedom”: According to this point of view, the highest degree of freedom is if you can do anything with a material. This also includes commercial usage, which is usually closed for competitive reasons, or to integrate the material into a larger dataset which itself can not be open, because other parts of the data have restrictive licenses. We are not lawyers, but in our understanding this could, for example, also include restrictions due to privacy rights.
For example, imagine the compilation of an integrative database that includes both material from a copyleft source and another source that has individual-related material, which cannot be openly shared due to privacy rights (but could be shared as a restricted scientific use file). At least from our understanding, a strict copyleft license would preclude the reuse in such a restricted way. Hence, the copyleft license, although claiming to ensure freedom, does preclude a lot of potential reuse scenarios. From this point of view, a so-called permissive license (such as CC0, MIT, or BSD) provides more freedom than a copyleft license (see, e.g., The Whys and Hows of Licensing Scientific Code).
We propose a system that addresses both points of view, with the goal to provide some stickiness of scientific open sharing, but also the possibility to operate with scientific material that require restrictiveness, for example due to privacy rights.

The proposed copyleft license for open data: Open data requires open analysis code.

We suggest the following clause for the reuse of open research data:

Upon publication of any scientific work under a broad definition (including, but not limited to journal papers, books or book chapters, conference proceedings, blog posts) that is based in full or in part on this data set, all data analysis scripts involved in the creation of this work must be made openly available under a license that allows reuse (e.g., BSD or MIT).

(Of course more topics must be addressed in the license, such as the obligation to properly cite the authors of the data set, not to try to reidentify research participants, etc. But we focus only on the copyleft aspect here).
This system has some differences from traditional copyleft licenses.

First, usually the reuser has to share any derivative, which often is the same category as the open material (typically: you reuse a piece of software, and have to share your own software product under an open license). In this proposal, you reuse open data, and have to share open analysis code. Hence, you support the openness of a community in another currency. Without the need to publish derived data sets, integration scenarios of usually incompatible, open and closed data become possible.
Second, it restricts the copyleft property to a certain type of reuse, namely the creation of scientific work. This ensures, on the one hand, that open knowledge grows and scientific claims are verifiable to a larger extent than before. On the other hand, commercial reuse is enabled; furthermore there might be non-scientific reuse scenarios that do not involve analysis code, where the clause is not applicable anyway. Finally, even the most restrictive data set (where you have to go to a repository operator and analyze the data on dedicated computers in a secure room) can generate open derivatives.
Third, the license is not sticky: The published open analysis code itself does not require a copyleft when it is reused. Instead it has a permissive license.

Against the “research parasite” argument

The proposed system offers some protection against the “research parasites” argument. The parasite discussion refers to the free-rider problem in social dilemmas: While some people invest resources to provide a public good, others (the parasites/free-riders) profit from the public good, without giving back to the community (see also Linek et al., 2017). This often creates a feeling of injustice, and impulses to punish the free-riders. (An entire scientific field is devoted to the structural, sociological, political, and psychological properties and consequences of such social dilemma structures.)
In the proposed licensing system, those who profit from openness by reusing open data must give something back to the community. This increases overall openness, reusability, and reproducibility of scientific outputs, and probably decreases feelings of exploitation and unfairness for the data providers.
Do you think such a license would work? Do you see any drawbacks we didn’t think of?
You can leave feedback here as a comment, on Twitter (@nicebread303) or via email to felix@nicebread.de.

The proposed copyleft license for open data: Open data requires open analysis code.

Against the “research parasite” argument

Leave a Reply Cancel reply