News

DataPLANT participates in the CfP for the "E-Science-Tage 2021: Share Your Research Data"

DataPLANT submitted three proposals following the Call for Papers of the "E-Science-Tage 2021: Share Your Research Data" scheduled for beginning of March in Heidelberg. The consortium plans to participate in the workshop suggested by the NFDI directorate to present the fundamental plant research community as part of the future research data management landscape. Such an integrated RDM landscape and services enables reproducible research, the linking of interdisciplinary expertise, the sharing of research for comparison and integration of different analysis results and metadata studies, taking advantage of the immense additional knowledge gained from them. Additionally, we suggested a short paper on the DataPLANT data steward model, as a core element of a holistic strategy for managing research data in the field of plant research. Research groups will profit from direct support in their daily tasks ranging from data organization to the selection of the proper tools, workflows and standards. Data stewards play a special hinge role between service providers, individual researchers, groups and the wider community. They also help bridging the gap between researchers and technical systems. The coordinated deployment of data stewards supports the adherence to good scientific practice among the research community.

Following the Kick-Off Task Area 2 "Software / Services" we started to draft an outline for the Annotated Research Context (ARC) as a starting point for experiments. An ARC  captures the complete research cycle in a structured way, meeting the FAIR requirements whilst trying to mimic the way an  individual researcher experimentsworks. ARCs are self-contained and include assay and /measurement data, workflows and computation results, accompanied by metadata in one package. Their structure allows full user-control over all metadata and facilitates usability, access, publication and sharing of the research. Thereby, ARCs are a practical implementation of existing standards leveraging the advantages of the “ISA model”, “research crates” and the “Common Workflow language”. The ARC concept relies on a structure that partitions assay, workflow and results for granular reuse and development. Assays cover biological, experimental and instrumental data including its self-contained description using the ISA model. Similarly, workflows describe all digital steps of a study and contain application code, script and/or any other executable description of an analysis providing the highest degree of flexibility for the scientists.

 

Participation in the virtual HeFDI-Plenary at Philipps-University Marburg

The Hessian Research Data Infrastructures (HeFDI) hosted its first virtual Plenary on December 17 last year, and DataPLANT was on board. The primary focus of the event was both cross-site networking of RDM-related groups and illustrating the change about the way data is handled. Especially the inspiring keynote by Prof. Dr. Iris Pigeot from BIPS Bremen, entitled "Data Science and Data Sharing - Mission Impossible without Intelligent Research Data Management?" forced the urgency of well conceived data management. In the subsequent parallel poster sessions, various projects and infrastructure services presented themselves. The poster introducing DataPLANT was launched in a breakout session together with the NFDI4BioDiversity poster. One of the common challenges faced by the discussing participants was the acquisition and training of qualified personnel especially data stewards The event successfully invited for networking between different consortia and we are looking forward to a next event.

 

Gitlab and large files - data sharing and versioning for the DataPLANT community

DataPLANT needs a solid technical base for collaboration within projects and between (inter)national research groups. This can be achieved through a framework which supports data versioning and sharing. The starting point is the Annotated Research Context (ARC) which got presented in an Kick-Off Task Area 2 "Software / Services". A widely used platform - well beyond it's original purpose of maintaining code in collaborative software projects - is the versioning software Git. As the ARC consists of multiple file formats including large files of raw data from various inputs it needs to deal with large files as well. As git was originally created with source code in mind, the plain version is not well suited in this regard as it is implemented as a distributed version control system (DVCS). It is not centralized by default and does not implement an inherent repo hierarchy. All clones contain the full history by default. Git uses sparse clones, sparse checkouts but still performs poorly with larger files. A possible solution is to use sparse checkouts for large repositories
which got introduced to Git 2.25.0, released beginning of this year. The idea is to store smaller (text) files with Git, and larger files outside of Git. The versioning is handled by storing references to externally stored (large) files in Git.

There are several implementations for this purpose available. Git Large File Storage (LFS), Git-annex / DataLad and Data Version Control (DVC). Git-LFS is developed and maintained by GitHub and written in the Go language. It uses the Git Smudge filter to replace the pointer file with the actual file content. It works transparent to the user (Git LFS needs to be installed for that to work, though). LFS uses reflinks (if possible) or deep copies. It stores the pointer files in Git and file contents in a special LFS storage. It requires a dedicated server for managing LFS objects.

Git-annex (Homepage: https://git-annex.branchable.com/) is e.g. used by DataLad which is popular in the Neuro Science community. It is programmed in Haskell (Git-annex) and Python (DataLad). It deploys symlinks by default but also supports hardlinks, reflinks or copying of data. It maintains file information in a dedicated annex branch. Git-annex directly supports a large number of different storage systems. DVC is a popular framework in Machine Learning community and written in Python. It uses reflinks by default but can also support symlinks, hardlinks or copying. It stores file information in .dvc files (YAML format) and directly supports S3, GCS, SFTP, HDFS or filesystem as a backend.

There are a couple of Free and/or Open Source Software Git Collaboration Platforms like GitLab (Community Edition) and Gitea (community-driven fork of Gogs) out there as well as non-free or service-only like, GitHub (cloud service, on-premises enterprise product available), GitLab (cloud service, on-premises Enterprise Edition) or Atlassian Bitbucket (cloud service only). Gitlab - most relevant to the purposes of DataPLANT is one of the "Big Three" players (GitHub, GitLab, Bitbucket). It can handle large amount of repositories and users. It provides many useful collaboration features like issue tracker, wiki, online editor. It incorporates a well-established integrated CI/CD system, but the Community Edition provides a significantly reduced feature set only.