Participation in the virtual HeFDI-Plenary at Philipps-University Marburg
- Published on Thursday, 17 December 2020
The Hessian Research Data Infrastructures (HeFDI) hosted its first virtual Plenary on December 17 last year, and DataPLANT was on board. The primary focus of the event was both cross-site networking of RDM-related groups and illustrating the change about the way data is handled. Especially the inspiring keynote by Prof. Dr. Iris Pigeot from BIPS Bremen, entitled "Data Science and Data Sharing - Mission Impossible without Intelligent Research Data Management?" forced the urgency of well conceived data management. In the subsequent parallel poster sessions, various projects and infrastructure services presented themselves. The poster introducing DataPLANT was launched in a breakout session together with the NFDI4BioDiversity poster. One of the common challenges faced by the discussing participants was the acquisition and training of qualified personnel especially data stewards The event successfully invited for networking between different consortia and we are looking forward to a next event.
Gitlab and large files - data sharing and versioning for the DataPLANT community
- Published on Friday, 11 December 2020
DataPLANT needs a solid technical base for collaboration within projects and between (inter)national research groups. This can be achieved through a framework which supports data versioning and sharing. The starting point is the Annotated Research Context (ARC) which got presented in an Kick-Off Task Area 2 "Software / Services". A widely used platform - well beyond it's original purpose of maintaining code in collaborative software projects - is the versioning software Git. As the ARC consists of multiple file formats including large files of raw data from various inputs it needs to deal with large files as well. As git was originally created with source code in mind, the plain version is not well suited in this regard as it is implemented as a distributed version control system (DVCS). It is not centralized by default and does not implement an inherent repo hierarchy. All clones contain the full history by default. Git uses sparse clones, sparse checkouts but still performs poorly with larger files. A possible solution is to use sparse checkouts for large repositories
which got introduced to Git 2.25.0, released beginning of this year. The idea is to store smaller (text) files with Git, and larger files outside of Git. The versioning is handled by storing references to externally stored (large) files in Git.
There are several implementations for this purpose available. Git Large File Storage (LFS), Git-annex / DataLad and Data Version Control (DVC). Git-LFS is developed and maintained by GitHub and written in the Go language. It uses the Git Smudge filter to replace the pointer file with the actual file content. It works transparent to the user (Git LFS needs to be installed for that to work, though). LFS uses reflinks (if possible) or deep copies. It stores the pointer files in Git and file contents in a special LFS storage. It requires a dedicated server for managing LFS objects.
Git-annex (Homepage: https://git-annex.branchable.com/) is e.g. used by DataLad which is popular in the Neuro Science community. It is programmed in Haskell (Git-annex) and Python (DataLad). It deploys symlinks by default but also supports hardlinks, reflinks or copying of data. It maintains file information in a dedicated annex branch. Git-annex directly supports a large number of different storage systems. DVC is a popular framework in Machine Learning community and written in Python. It uses reflinks by default but can also support symlinks, hardlinks or copying. It stores file information in .dvc files (YAML format) and directly supports S3, GCS, SFTP, HDFS or filesystem as a backend.
There are a couple of Free and/or Open Source Software Git Collaboration Platforms like GitLab (Community Edition) and Gitea (community-driven fork of Gogs) out there as well as non-free or service-only like, GitHub (cloud service, on-premises enterprise product available), GitLab (cloud service, on-premises Enterprise Edition) or Atlassian Bitbucket (cloud service only). Gitlab - most relevant to the purposes of DataPLANT is one of the "Big Three" players (GitHub, GitLab, Bitbucket). It can handle large amount of repositories and users. It provides many useful collaboration features like issue tracker, wiki, online editor. It incorporates a well-established integrated CI/CD system, but the Community Edition provides a significantly reduced feature set only.
Persistent person identifiers for long time research data
- Published on Tuesday, 24 November 2020
In a significantly networked and highly collaborative scientific field such as plant research, the goal is to jointly use and federate services for data management. With the goal of a well acknowledged data publication in mind a persistent link between research objects like published Annotated Research Contexts (ARC) and persons need to be established and maintained. Of central importance for this objective are persistent identifiers of researchers. Because of the high turnover within this group of individuals, agreement among all stakeholders in the science enterprise on a uniform, internationally recognized, and institution-wide system would be a considerable relief, since switching between institutions would no longer require changes in the database. Such an internationally recognized identifier should be stable and unique for individuals. In the course of the work of the Baden-Württemberg-based Science Data Center BioDATEN, which has personnel overlaps with DataPLANT, initial discussions were held in this direction.
In a first step, BioDATEN has joined the Memorandum of Understanding (MoU) of DINI and made a recommendation for ORCID. In this course it is recommended to identify persons in research information systems, repositories and research data management via the ORCID ID and repositories via re3data. These considerations are also pending for DataPLANT and are to be discussed in the context of standardization (TA 1) or in the Scientific Board. The ORCID ID already has a high degree of diffusion and acceptance in the community. It does not have to remain the exclusive identifier.