Research Software in Academic Hiring and Promotion: A proposal for how to assess it

In 2021, the German Psychological Society (DGPs) signed the DORA declaration. In consequence, they recently installed a task force with the goal to create a recommendation how a responsible research assessment could be practically implemented in hiring and promotion within the field of psychology.

In our current draft (not public yet) we want to decenter (A) scientific publications as the primary research output that counts, and recommend to also take (B) published data sets, and the development and maintenance of (C) research software into consideration. (Along with Recognition and Rewards and other initiatives, we also call for taking Teaching, Leadership skills, Service to the institution/field, and Societal impact into account. In the white paper, however, we only address the operationalization of the Research dimension).

Concerning research software, we worked on an operationalization. This is inspired from:

the INRIA Evaluation Committee Criteria for Software Self-Assessment
Alliez, P., Cosmo, R. D., Guedj, B., Girault, A., Hacid, M.-S., Legrand, A., & Rougier, N. (2020). Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria. Computing in Science & Engineering, 22(1), 39–52. https://doi.org/10.1109/MCSE.2019.2949413
Gomez-Diaz T and Recio T. On the evaluation of research software: the CDUR procedure [version 2; peer review: 2 approved]. /F1000Research/ 2019, 8:1353 (https://doi.org/10.12688/f1000research.19994.2)

Please note that …

The system should be as easy as possible (otherwise it will not be used in hiring committees)
Psychologists are not computer scientists, so existing criteria for those might be too advanced.
As R is the #1 open source software for statistical computing in psychology, so all examples relate to R.

Here is our current draft of the research software section. As we are not aware of any concrete implementation of assessing research software for hiring or promotion purposes (at least not in psychology or neighboring fields), we like to ask the community for feedback. At the end of the post we list three ways how you can comment.

DRAFT SECTION FOR OPERATIONALIZING RESEARCH SOFTWARE CONTRIBUTIONS IN HIRING AND PROMOTION

(C) Research Software Contributions

Research software is a vital part of modern data-driven science that fuels both data collection (e.g., PsychoPy, Peirce et al., 2019, or lab.js, Henninger et al., 2021) and analysis (see, for example, R and the many contributed packages). In some cases, the functioning of entire scientific disciplines depends on the work of a few (often unpaid) software maintainers of critical software (Muna et al., 2016). Furthermore, non-commercial open source software is a necessary building block for computational transparency, reproducibility, and a thriving and inclusive scientific community. Instead of being a “career suicide”, it is high time that research software development is properly acknowledged in hiring and promotion.

Some research software is accompanied with a citable paper describing the software (e.g., for the lavaan structural equation modeling package in R: Rosseel, 2012). However, these “one-shot” descriptions of software often do not appropriately reflect the continuous work and changing teams that are necessary to develop and maintain research software. Therefore we include “Contributions to Research Software” as a separate category with their own quality criteria. Note that this category (C) only refers to dedicated, reusable research software, not to specific analysis scripts for a particular project. The latter should be listed under “Open reproducible scripts” of the respective paper in section (A).

For the evaluation of contributed research software, applicants can list up to 5 software artifacts along with the self-assessment criteria presented in Table 3 (a more comprehensive evaluation scheme with more quality criteria is proposed in Appendix A). Contributor roles are taken from the INRIA Evaluation Committee Criteria for Software Self-Assessment.

Table 3. Simple evaluation scheme for research software, with one specific example

	Research Software 1	URL	Comment
Title	R package RSA	https://CRAN.R-project.org/package=RSA
Citation	Schönbrodt, F. D. & Humberg, S. (2021). RSA: An R package for response surface analysis (version 0.10.4). Retrieved from https://cran.r-project.org/package=RSA
Short description	An R package for Response Surface Analysis
Date of first full release	2013		Necessary to compute citations relative to age of software
Date of most recent major release	2020		Indicates whether software is actively maintained
Contributor roles and involvement	DA-3 CD-3 MS-3		What has the applicant contributed? For each of the 3 roles: – design and architecture (DA) – coding and debugging (CD) – maintenance and support (MS) … specify if you are: 0. not involved 1. an occasional contributor 2. a regular contributor 3. a main contributor Example: DA-2, CD-3, MS-1
License	GPLv3		Is the software open source?
Scientific impact indicators:
Downloads or users per month	710 downloads / month	https://cranlogs.r-pkg.org/badges/RSA
Citations	110	https://scholar.google.de/citations?view_op=view_citation&hl=de&user=KMy_6VIAAAAJ&citation_for_view=KMy_6VIAAAAJ:mB3voiENLucC	Evaluate relative to the age of software
Other impact indicators (optional)	–		E.g., Github stars, number of dependencies. Be careful and responsible when using metrics, in particular when they are black-box algorithms.
Reusability indicator	R3		Levels of the reusability indicator: R1 (0.25 points): Single scripts, loose documentation, no long-term maintenance. Prototype: A collection of reusable R scripts on OSF. R2 (1 points): Well-developed and tested software, fairly extensive documentation. Some attention to usability and user feedback. Not necessarily regularly updated. Prototype: A small CRAN package with no more active development (just maintenance) R3 (2 points): Major software project, strong attention to functionality and usability, extensive documentation, systematic bug chasing and unit testing, external quality control (e.g. by uploading to CRAN). Regularly updated. Prototype: Well received and actively maintained CRAN package. R4 (6 points): Critical infrastructure software. Hundreds of research projects use or depend on the software (+ all criteria of R3). Prototype: lavaan package.
Merit / impact statement (narrative, max 100 words)	The RSA package has become a standard package for computing and visualizing response surface analyses in psychology. A PsycInfo search for “response surface analysis” (from 2022-05-18) revealed that of the 20 most recent publications, 35% used our package (although 2 of 7 did not cite it). Several features, such as computation of multiple standard models and model comparisons are unique to this package.
Reward Points	(3+3+3)/3 * 3 = 9		Take the average value of the 3 contributor roles and multiply it with the points of the level of the reusability indicator.

Is there essential information missing in the table?

Calibrating reward points

We also want to offer a suggestion how to compute „reward points“. The goal is to bring the categories of „publications“ and „software contributions“ onto a common evaluative dimension. This gets a bit complicated, as we also propose bonus points for publications with certain quality criteria, so not every publication gets the same number of points. For the moment, imagine a publication of good quality (neither a quickly churned out low-quality publication, nor an outstanding, seminal contribution). What is the “paper equivalent” of a software contribution? Note that these bonus points are thought as incremental to an existing paper that describes the software.

Here’s our suggestion, being aware that it is easy to find counter-examples that do not fit in the system. But we are happy if our system is an incremental improvement over the status quo (which is: to ignore software contributions and to count the number of papers without any quality weighting):

Research Software Prototype	Paper equivalents (of good quality)
Simple script (a few hundred lines) with reuse potential, completely done by applicant	0.25
A well-developed CRAN package: Occasional co-developer with a minor contribution	0.5
A well-developed CRAN package: Active co-developer with major contribution	1
A well-developed CRAN package: Main developer	2
Critical infrastructure: Regular co-developer	2
Critical infrastructure (e.g., lavaan): Main developer	5

How to comment?

If you have comments, you can …

post them here below the blog post
write an email to felix.schoenbrodt@psy.lmu.de
directly add your comments in a Google doc

Thanks for your help!

Thanks for this great draft!
I have a few unconnected thoughts (and apologies in case I have missed how they are already addressed in the draft – I couldn’t find Appendix A so it may all be addressed already):

1) While R may currently be the predominantly used language in psychology, other languages might take its place in the future or get at least more common. I think its valid to have R-based examples, but a language agnostic general operationalization could potentially improve the draft. For example using a term like “Software available in language-appropriate software repositories or package indices/registries” to not only include CRAN for R in the examples, but also make room for different registries (e.g., PyPi for Python, npm for js, juliapackages Julia, GitHub for go etc.) and would allow for future developments of different package indices in the future.

2) I believe that the number of downloads of a package is a very intuitive indication of scientific impact, but I want to note that this index can be highly conflated by continuous integration testing, which can increase the number of downloads significantly. In our project we performed an alternative calculation based on https://popcon.debian.org (for the Debian operating system, users can opt-in to submit information on which packages they install to a popularity contest, based on which one could extrapolate). In any case, getting an accurate count of usage/users is a difficult problem – as an open source software maintainer myself, I would also dislike to have to incorporate any usage tracking into my software to pacify hiring committees (I don’t have a solution, but wanted to voice this concern (: ). And lastly, different packages have different installation patterns – A python package that is reinstalled in every virtual environment of a researcher will get more downloads than a system package for a compute cluster that’s only upgraded once a year.

3) A different metric for impact may be discipline-specific Technology Readiness level https://en.wikipedia.org/wiki/Technology_readiness_level, although I am not aware of a psychology-specific one

4) I stumbled over the point “Date of most recent major release” – I know quite a few projects (Python, I’m not well-versed in the R universe) that took decades to arrive at their first major release (this is if we’re talking about semantic versioning https://semver.org 2.0.0, with a major release being for example 2.0.0, a minor one being of 0.2.0, and a patch release being 0.0.2). For example, scipy 1.0 was released almost 20 years after its initial release, MNE Python (https://github.com/mne-tools/mne-python/) just recently released 1.0 after 13 years, pymer4 hasn’t reached a major release yet (https://github.com/ejolly/pymer4), etc. Focusing on a major release only would in those cases paint a wrong picture (or sabotage proper semantic versioning), and maybe a minor release may be better suited to capture a package’s maintenance.

Thanks for the great proposal, I hope my comments contain some relevant thoughts. 🙂

5 thoughts on “Research Software in Academic Hiring and Promotion: A proposal for how to assess it”

Adina Wagner says:

2022-05-23 at 15:51

Thanks for this great draft!
I have a few unconnected thoughts (and apologies in case I have missed how they are already addressed in the draft – I couldn’t find Appendix A so it may all be addressed already):

1) While R may currently be the predominantly used language in psychology, other languages might take its place in the future or get at least more common. I think its valid to have R-based examples, but a language agnostic general operationalization could potentially improve the draft. For example using a term like “Software available in language-appropriate software repositories or package indices/registries” to not only include CRAN for R in the examples, but also make room for different registries (e.g., PyPi for Python, npm for js, juliapackages Julia, GitHub for go etc.) and would allow for future developments of different package indices in the future.

2) I believe that the number of downloads of a package is a very intuitive indication of scientific impact, but I want to note that this index can be highly conflated by continuous integration testing, which can increase the number of downloads significantly. In our project we performed an alternative calculation based on https://popcon.debian.org (for the Debian operating system, users can opt-in to submit information on which packages they install to a popularity contest, based on which one could extrapolate). In any case, getting an accurate count of usage/users is a difficult problem – as an open source software maintainer myself, I would also dislike to have to incorporate any usage tracking into my software to pacify hiring committees (I don’t have a solution, but wanted to voice this concern (: ). And lastly, different packages have different installation patterns – A python package that is reinstalled in every virtual environment of a researcher will get more downloads than a system package for a compute cluster that’s only upgraded once a year.

3) A different metric for impact may be discipline-specific Technology Readiness level https://en.wikipedia.org/wiki/Technology_readiness_level, although I am not aware of a psychology-specific one

4) I stumbled over the point “Date of most recent major release” – I know quite a few projects (Python, I’m not well-versed in the R universe) that took decades to arrive at their first major release (this is if we’re talking about semantic versioning https://semver.org 2.0.0, with a major release being for example 2.0.0, a minor one being of 0.2.0, and a patch release being 0.0.2). For example, scipy 1.0 was released almost 20 years after its initial release, MNE Python (https://github.com/mne-tools/mne-python/) just recently released 1.0 after 13 years, pymer4 hasn’t reached a major release yet (https://github.com/ejolly/pymer4), etc. Focusing on a major release only would in those cases paint a wrong picture (or sabotage proper semantic versioning), and maybe a minor release may be better suited to capture a package’s maintenance.

Thanks for the great proposal, I hope my comments contain some relevant thoughts. 🙂

1. Benjamin Uekermann says:
  
  2022-07-22 at 12:05
  
  I agree on the observation “major vs. minor software release”. Actually, good practice is to not do major releases very often as they might break your users’ codes.
  I would replace “Date of most recent major release” with “Date of most recent minor or major release”.
  
Pingback: Research Software in Academic Hiring and Promotion: A proposal for how to assess it | R-bloggers
SC says:

2022-05-24 at 23:12

This is an excellent framework. I love the idea of getting credit for all of the work we do. That said, I have 1 large and 1 smaller concern (right now).

1. Large. This could be really useful in departments where software development is common, but that’s the minority of Psychology departments in my experience. In my department of 60+ full-time faculty, there are perhaps 5 of us who would ever want to use this for our own research or, maybe more importantly, would be able to navigate assessing someone else with it — there just aren’t enough people who are familiar enough with any of it. So I could see the contributions being minimized or eliminated, just because some people don’t understand them. Similar to what happens with a lot of community outreach that ends up just lumped into service.

2. Smaller. Some of the criteria are more fine-grained than we use to assess articles (for better or for worse…). For example, there are conventions that first and last author are places of prestige for articles, so we often have “lead” and “non-lead” author distinctions. Any time there are more options, as there are here, it’s more difficult for people to judge. Maybe something as simple as “major” and “minor” contributor? Or something like grants — PI, co-I, key personnel? But something simpler and possibly more aligned with other systems we already use.

Philipp Probst says:

2022-05-25 at 13:29

Great idea. “cranlogs” (as well as paper citation) can be manipulated, e.g. if one writes a script, that downloads the package every day many times. This is not very difficult. So the “Downloads or users per month” is a very nice metric, but one should be aware, that it can be manipulated.

The vision should be, that software development should be recognized in general. E.g. Michel Lang has done much more than others for the scientific community in R (https://www.statistik.tu-dortmund.de/lang.html) but he will never be a professor, if he does not publish more maybe unnecessary papers.

Best regards,
Philipp

DRAFT SECTION FOR OPERATIONALIZING RESEARCH SOFTWARE CONTRIBUTIONS IN HIRING AND PROMOTION

(C) Research Software Contributions

Calibrating reward points

5 thoughts on “Research Software in Academic Hiring and Promotion: A proposal for how to assess it”

Leave a Reply to Benjamin Uekermann Cancel reply