News

Gitlab and large files - data sharing and versioning for the DataPLANT community

DataPLANT needs a solid technical base for collaboration within projects and between (inter)national research groups. This can be achieved through a framework which supports data versioning and sharing. The starting point is the Annotated Research Context (ARC) which got presented in an Kick-Off Task Area 2 "Software / Services". A widely used platform - well beyond it's original purpose of maintaining code in collaborative software projects - is the versioning software Git. As the ARC consists of multiple file formats including large files of raw data from various inputs it needs to deal with large files as well. As git was originally created with source code in mind, the plain version is not well suited in this regard as it is implemented as a distributed version control system (DVCS). It is not centralized by default and does not implement an inherent repo hierarchy. All clones contain the full history by default. Git uses sparse clones, sparse checkouts but still performs poorly with larger files. A possible solution is to use sparse checkouts for large repositories
which got introduced to Git 2.25.0, released beginning of this year. The idea is to store smaller (text) files with Git, and larger files outside of Git. The versioning is handled by storing references to externally stored (large) files in Git.

There are several implementations for this purpose available. Git Large File Storage (LFS), Git-annex / DataLad and Data Version Control (DVC). Git-LFS is developed and maintained by GitHub and written in the Go language. It uses the Git Smudge filter to replace the pointer file with the actual file content. It works transparent to the user (Git LFS needs to be installed for that to work, though). LFS uses reflinks (if possible) or deep copies. It stores the pointer files in Git and file contents in a special LFS storage. It requires a dedicated server for managing LFS objects.

Git-annex (Homepage: https://git-annex.branchable.com/) is e.g. used by DataLad which is popular in the Neuro Science community. It is programmed in Haskell (Git-annex) and Python (DataLad). It deploys symlinks by default but also supports hardlinks, reflinks or copying of data. It maintains file information in a dedicated annex branch. Git-annex directly supports a large number of different storage systems. DVC is a popular framework in Machine Learning community and written in Python. It uses reflinks by default but can also support symlinks, hardlinks or copying. It stores file information in .dvc files (YAML format) and directly supports S3, GCS, SFTP, HDFS or filesystem as a backend.

There are a couple of Free and/or Open Source Software Git Collaboration Platforms like GitLab (Community Edition) and Gitea (community-driven fork of Gogs) out there as well as non-free or service-only like, GitHub (cloud service, on-premises enterprise product available), GitLab (cloud service, on-premises Enterprise Edition) or Atlassian Bitbucket (cloud service only). Gitlab - most relevant to the purposes of DataPLANT is one of the "Big Three" players (GitHub, GitLab, Bitbucket). It can handle large amount of repositories and users. It provides many useful collaboration features like issue tracker, wiki, online editor. It incorporates a well-established integrated CI/CD system, but the Community Edition provides a significantly reduced feature set only.

Persistent person identifiers for long time research data

In a significantly networked and highly collaborative scientific field such as plant research, the goal is to jointly use and federate services for data management. With the goal of a well acknowledged data publication in mind a persistent link between research objects like published Annotated Research Contexts (ARC) and persons need to be established and maintained. Of central importance for this objective are persistent identifiers of researchers. Because of the high turnover within this group of individuals, agreement among all stakeholders in the science enterprise on a uniform, internationally recognized, and institution-wide system would be a considerable relief, since switching between institutions would no longer require changes in the database. Such an internationally recognized identifier should be stable and unique for individuals. In the course of the work of the Baden-W├╝rttemberg-based Science Data Center BioDATEN, which has personnel overlaps with DataPLANT, initial discussions were held in this direction.

In a first step, BioDATEN has joined the Memorandum of Understanding (MoU) of DINI and made a recommendation for ORCID. In this course it is recommended to identify persons in research information systems, repositories and research data management via the ORCID ID and repositories via re3data. These considerations are also pending for DataPLANT and are to be discussed in the context of standardization (TA 1) or in the Scientific Board. The ORCID ID already has a high degree of diffusion and acceptance in the community. It does not have to remain the exclusive identifier.

State of the backend storage infrastructure is evolving

The storage system bwSFS (Storage-for-Science) forms the geo-redundant distributed technical backbone for basic storage services, research data management and sharing of data. It contributes to the storage infrastructure for the DataPLANT community beside the de.NBI infrastructure services. The central storage components of bwSFS are located at the T├╝bingen and Freiburg computer centers. In order to reasonably manage the intended broad user base of the system - besides DataPLANT, the local communities of the participating universities and the Science Data Center BioDATEN are served as well - and to achieve a seamless integration into the envisioned DataPLANT services, a federated management of project, user and group data is necessary. Already in the implementation phase of the software and services, which involves the subject sciences, it becomes apparent that the existing methods for identity management are not sufficient. Compared to HPC services, storage services require a much deeper integration of existing infrastructures and a more flexible user management, which should also include ORCID, for example.

To support research data management, the use of InvenioRDM, which already includes a convenient user interface and the OAI-PMH interface, is chosen within bwSFS. All central institutions and projects involved in the RDM process were included in this decision at an early stage. In Freiburg, a Gitlab for versioning, collaboration and sharing of data of ongoing projects will be used in the context of TA 4. Established services of the university libraries will be used for DOI allocation in Invenio, and ORCID will be used for persistent identification of researchers. In this way, resources for RDM will be pooled in order to improve support for the disciplines in the implementation of specific RDM requirements and to ensure better advice for researchers. To enforce the FAIR and OpenAccess principles, DMPs will be used to support standardization in TA 1 with guidelines for metadata management, archiving and licensing models.