baracktocat

10 non-trivial things GitHub & friends can do for science

6 1331

This year GitHub, FigShare and Mozilla ScienceLab introduced the minting of DOIs for software. It’s worth to mention that Zenodo is part of the game with less marketing rumble. Probably the latest blog post from Arfon Smith (GitHub) will boost the number of software publications so that software may become more citable, thus may become part of the scientific tradition and may address open reproducibility issues.

So it’s one of the best things this year that key players like GitHub, FigShare, Mozilla ScienceLab, and Zenodo allow the labelling of software with DOIs. The recognition of this topic gets a different dimension. That’s great!

Interests matter

However, interests matter, especially commercial interests. GitHub has recognized that the amount of scientific software hosted on GitHub has reached a critical level. So researchers are a target group that may become valuable customers, if not already. So Arfon Smith, known for co-founding Zooniverse and thus known to researchers and to the interested public alike, is taking care of GitHub’s fitness for the scientific community. This happens in alliance with other key players, FigShare and Mozilla ScienceLab, while making decent but recognized marketing in the scientific community. Obviously there is no reason to complain since commercial interests support sciences with various business models already. In this case GitHub and FigShare / Zenodo should be seen as a new type of publisher concentrating on other types of publishable material different from plain text and reviewed papers. This type of business is welcome if it supports the scientific community. Time will tell if these commercial interests address researchers’ needs and provide solutions for current problems in sciences. Open Access journals and new publishing approaches as practices by PLOS and PeerJ have come to life due to the problems encountered in the past. We shouldn’t repeat this hurtful and long-winded process by cementing half-baked solutions for the publication of software.

So it’s time to mention that the current solution has drawbacks which may become serious problems if not solved in the due course.

Open issues

1. GitHub sticks DOIs to code copies via FigShare or Zenodo. GitHub should do this by itself. So why is GitHub doing this via FigShare and Zenodo? Eventually GitHub wants to save some money for minting DOIs. But more importantly GitHub gets rid of responsibilities associated with DOIs in the scientific world. Thus they are allowed to act freely without any commitments to the scientific community. This leads to serious problems, at least for researchers sooner or later. Furthermore, DOIs should refer to code freezes and revisions in a completely different way, other than just pointing to detached copies in FigShare and Zenodo so far being dead-ends with no way out.

2. GitHub, FigShare, and Zenodo stick DOIs to code copies. Again, plain code copies! It would be no difference to zip code from repositories or from file systems to publish it then as ‘data’ with a DOI. There is absolutely no difference to a static data publication just because it is labelled with ‘software’.

3. The connection between frozen code copies and the further developed software isn’t addressed sufficiently  It’s important to find the way back to the ‘original’ that has been copied and then further to new versions or eventually forks, branches, etc. Probably this could be addressed much better if the DOI would be minted to the original revision instead to a copy. However, for the current code copy solution this could be addressed with proper metadata and linking. This is done insufficiently so far, e.g. by linking the main repository without version information.

4. DOIs on software imply valuable publications. This isn’t the case here. A DOI is just an identifier enabling citability. The quality issue is not addressed so far.

5. This means, that the topic of software publications in terms of valuable publications is not addressed either. But especially this topic costs a lot of efforts. It’s about breaking with traditional processes and expanding them, so that it’s possible to publish software sufficiently and seriously in scientific context. Minting a DOI to a code copy without any quality control is not a serious publication.

6. The software or code is not citable. Check samples, e.g. Dynsim, Nimbus, Scythe, and ask yourself if the code has been copied directly from GitHub to mint a DOI. Then check the ‘Cite this’ and ‘Export’ sections what information is provided at all. You may even use DataCite’s crossref to check if the software is citable. You may recognize that the version or revision of the software is missing along with other information. Using these snippets for citations of these kinds in text papers will produce problems most likely.

fidgit

7. It isn’t clear which metadata is set automatically. Publishers and library experts of the DOI world should check what information should be set automatically and what manually. Especially which fields should be set in addition to the limited fields already offered. In addition, a comprehensive guideline, clear recommendations, or best practices would help so that each DOI-fication of code can be processed with guidance. Normally DOIs are minted by publishers so that submitters don’t care much about the DOI world and library aspects. This is different in the process of transferring GitHub repository copies to FigShare or Zenodo. The metadata is curated by the submitter itself or automatically.

8. Without proper metadata for detached copies – with almost no connection to its living original and not embedded in its ecosystem – software can’t cite software. Referring to third-party libraries and mentioning dependencies used to enable others to run the code is just half of the game. This information has an indispensable value so that not only citations in papers are counted but also how often software is used by other software. Just check CRAN with the information on reverse dependencies, imports and linkings allowing to understand quickly which packages are used quite often in other packages and thus are of importance for the specific package in its ecosystem. Debian provides this information in RDF along with packages. So citation of software and other citable items in software may lead to important metrics in the future.

9. As mentioned before, metrics aren’t addressed properly, e.g. how software is used by other software and referenced in papers. However, it’s not clear which metrics may be used for evaluation purposes. But available solutions and new ideas just wait for their implementation to gain experiences in the field of software publications. Thus software could be acknowledged in researchers’ publication lists similar to high-ranking paper publications.

10. Software DOIs are not used to their full potential. DOIs follow a well defined format. This format just has to be extended a bit to follow the footprints of software and its tree of life, e.g. 10.6084/figshare.fidgit.0-0-3 or 10.4321/98765.fidgit.0-0-4. Citations in other software and in papers could be looked up by simple searches with wildcards to gain detailed insights.

Keep it up!

However, in summary the work done so far by GitHub, FigShare, Mozilla ScienceLab, and Zenodo is awesome! It has been an important and overdue step. But that’s just the first step on a long way to go. The current state is half-baked and shouldn’t be cemented. Please keep it up!

Also, you may cross-check how others perform in minting DOIs for software, e.g. Purdue’s HUBzero based nanoHUB minting DOIs for software since a while.

Special thanks go to Kaitlin’s and Arfon’s input at the BD2K  and CW14 workshops and to Jure’s inspiring post.

6 Comments

  1. Stacy Konkiel May 20, 2014 at 4:41 pm

    You’ve hit the nail on the head with some of the challenges that still lie ahead of us w/r/t code citeability. Specifically, the issue of granular DOIs that pair with different software versions. The only implementation of granular DOIs that I know of is Dryad’s, where you can theoretically link to a particular data file within a larger dataset. However, if you follow a granular DOI to their system, you’re taken to the same landing page as if you’d followed the non-granular DOI (for the entire dataset). Software versioning is different, though, and in someways I’d imagine easier to assign granular DOIs to.

    Anyway, the GitHub/Mozilla Sci/Figshare partnership is a valuable one that’s laying the groundwork for tackling some of these larger challenges you identify. I’m looking forward to seeing what they–and Sciforge–uncover.

  2. Robert Forkel July 30, 2014 at 3:39 pm

    The way I see it, you issues 1-3 are rather advantages. I was actually quite happy when I realized that ZENODO does archive snapshots of GitHub repositories. I wouldn’t want to depend on GitHub for this.

    I also don’t think that the simple fact of minting DOIs will make a private company act more responsibly with regards to users who might not be the best customers in terms of revenue.

    And thirdly, why should anything else but a plain copy of the code of a repository release be labelled with a DOI? I don’t think that these archives are meant to quick-start new developers for a software package.

    On a side not, I’ve written about how we use ZENODO’s integration with GitHub in the CLLD project: http://clld.org/2014/07/28/citing-clld-databases.html

    • Martin Hammitzsch July 30, 2014 at 7:04 pm

      Robert, many thanks for your valuable comments. Also your post on ‘Citing CLLD Databases and Reproducible Research’ is excellent. It addresses essential aspects and tools, and furthermore demonstrates state-of-the-art best practices scientists developing software should master nowadays to publish their results – in this case the software they developed.

      Regarding your comments, I agree, the archive snapshots of GitHub repositories are a huge step forward. However, there is room for improvement. So in terms of revenue, you refer to a business model. GitHub erased a business model by handing over responsibilities on archive snapshots and DOIs. From my limited point of view there are other beneficial alternatives. GitHub offers plans. So why not offering a plan for scientists or their institutions for minting DOIs, generating the required metadata, and integrating DOI related services. Every single Open Access publication is related to costs research institutions pay. So they can do for such plans as well. Of course, this would require a different thinking in sciences. But we are in this shift right now. Offering such a business model would push the transformation in science far ahead of what the majority of scientists could imagine. GitHub would become a publisher – a provider serving software repositories in a social landscape would become a software publisher! Just spin the wheel further towards this direction and you will find endless options that may shape the future of science.

      This leads us to the comment that archives are not meant to quick-start new developers for a software. I agree. These are archives. But now it becomes tricky. What is archived? Is an Open Access paper an archive? And if so, an archive of what? Knowledge? What’s the difference of a paper and an archived paper? … OK, I stop it.

      However, I expect that an Open Access paper helps me to follow and reproduce the presented findings. Sorry for being so demanding. But that’s what I expect also for software publications. Moreover, similar to an Open Access paper, I want to resolve a DOI, hit a landing page with relevant information, and get the findings so that I can work with them – in this case not the paper but the software or rather the code with further supplements and a bit more.

      So, resolving your DOI 10.5281/zenodo.11040 redirects me to a landing page maintained by Zenodo https://zenodo.org/record/11040 – Zenodo is great! But I want to be redirected to a landing page as maintained by GitHub https://github.com/clld/wals3/tree/v2014.2 – that’s the landing page I want! I don’t say that it necessarily must be GitHub. No, I mean a landing page as provided by GitHub. However, it misses the required metadata, services and integration in the infrastructure behind DOIs. So DOIs cannot redirect there. So the resulting solution has become a workaround by solving some of the issues and introducing new issues, e.g. taking the software out of its ecosystem and removing the history with valuable comments. This somehow is solved by providing the URL to the release in GitHub. Zenodo displays this URL under ‘Related publications and datasets: Supplement to:’ on the website and includes it in the DataCite, MARC, and MARCXML metadata under ‘isSupplementTo’. That’s odd, at least for me.

      Let’s push things forward. If Zenodo would maintain a Git to clone and freeze the release from GitHub then the cloned repository is part of the publication. We get rid of the zip archive and the ‘isSupplementTo’ information. Furthermore, Zenodo then could similarly present the repository content. I don’t talk about reimplementing GitHub. GitHub is excellent in what it does. Zenodo is excellent in what it does. But regarding the publication and preservation of software or rather source code Zenodo could do better by maintaining a Git and some software on top to browse through the repository on their website. Again, spinning the wheel further would mean that Zenodo may implement functionalities that we know from paper publications. Zenodo could integrate tools required for review. Imagine, reviews for scientific software! Incredible. For example, if I would have been able to review your software publication, then I could have suggested to separate the data from the software in two distinct publications, a data publication and a software publication :-)

      Of course, I understand the advantage of bundling the data with the software. Furthermore, including a comment on data quality in your post considering formats, metadata and validation is something that’s relevant to scientists developing software which handles data. The overall workflow presented in your post is great and deserves a DOI of its own.

      • Robert Forkel July 31, 2014 at 8:04 pm

        Ok. I see where you are going. And I agree with most of it. We basically have different perspectives: you are the visionary, I am the make-do-with-what-we-have type. Of course both roles are important.

        Where I disagree is the idea of separation of data and and software. I think this idea is similar to the one about separation of semantic markup and layout. It often isn’t as clear-cut as expected. In the WALS example, there are quite a few things hidden in the software which would have to go into the database to make this separation possible (chapter texts as the main example). But the software does also serve as description of the database. E.g. in the database there are two tables language and walslanguage, which have a one-to-one relation, i.e. one supplements the other (see http://clld.readthedocs.org/en/latest/resources.html#models). This could of course be explained somewhere (and it is, e.g. http://docs.sqlalchemy.org/en/rel_0_9/orm/inheritance.html#joined-table-inheritance), but the code serves as proof and documentation of this relation.

Leave a reply

Your email address will not be published. Required fields are marked *