Preliminary report - Discussion and considerations on ARC based data versioning and sharing

16 Sep 2021

In DataPLANT a working group formed to define a specification of the Annotated Research Context (ARC) and the necessary backend infrastructure for actual data versioning and sharing. This group looks into code repositories as typical processes found in software development align well to the goals of research data management (RDM).

Following examples in other disciplines DataPLANT explores the adaptation of Git in an RDM context. This news item provides a short outline of the current discussion on and status of the efforts of the DataPLANT consortium, within the framework of the National Research Data Infrastructure, to provide the technical basis for implementing corresponding backend infrastructure for the plant research community. The approach focuses not only the raw data gathered during experiments, but also considers analysis routines and derived data products within the context of RDM.

In DataPLANT ARCs are Git repositories, plus extras. The ARC builds on and implements existing standards like ISA for administrative and experimental metadata and CWL for analysis and workflow metadata. They are designed to represent digital objects that fulfill all FAIR principles and are therefore referred to as FAIR Digital Objects (FDO). To simplify the use of the versioning framework Git for non software developers the ARC Commander (an end user tool developed within the context of DataPLANT) wraps the repository interaction.

The ArcCommander is CLI tool for ARC operations which integrates Git operations, metadata maintenance and workflow ease-of-use: arc init implies git init plus extra ease-of-use defaults (it creates default files and folders like assays/, workflows/, runs/, isa.*.xlsx). arc push/pull equivalents to git push/pull and initiates an explicit up/download of local changes. arc update is the equivalent to the commands: [git lfs track +] git add + git commit. It auto-detects large files and auto-adds files and implements ARC structure sanity checks and some semi-automatic commit messages.

The ARC Commander sticks with the “Git repository + convenience layer” approach and reuses Git mechanisms where reasonable: git config, git hooks (e.g. sanity check in pre-commit-hook), .gitattributes for repository-level metadata, git tag for releases etc.

The Git mechanisms for collaboration and reuse discussed are: git-submodule, git-subtree, git-subrepo, datalad, ... (there are more approaches out there) for re-using assays and workflows. The design choices follow the principles of compatibility (play well with other Git-based tools), Robustness (avoid “I can’t work because someone else screwed up.”) and "make the lives of the operators harder rather than users".

For DataPLANT's purpose Git submodules and Git subtree have the wrong granularity as they include entire repositories as subdirs (not only parts, workaround could be sparse checkouts + symlinking), require continued cleverness in maintaining repository structure (need to account for existence of submodules in basically all Git commands). It is most Git-like and maximally decentralized, but also very hard to understand dependencies and ensure completeness. Another option would be datalad to manage tree of nested Git repositories which is based on git-annex. It is very powerful, but also brittle. Nothing could be done without the datalad tool. The approach is in principle decentralized, but requires central registry to work well to ensure data persistence, archiving and indexing. There is typically one registry per community (e.g. GIN, NFDI Neuroscience, ...)

Thus DataPLANT opts for a modified git-subrepo approach: An import fetches and adds imported subdir to local repository (“copy”) which is easily implemented using git-plumbing commands (git filter-branch and similar). For export the history of subdir is isolated and graft on original history done. I requires low-level Git object manipulation, but acts otherwise straightforward. It enables push/pull style collaboration, while avoiding complexity for users and retaining compatibility with the standard tools and keeping history easy to understand. The primary usecase are the (large) data files in assays/ When applying same design choices as before: git-lfs comes out as the (initial) choice. It offers minimal complexity, is well-supported and widely used. From the implementation perspective LFS just works and "git lfs track" can be automated during arc update. There is zero config needed against GitLab/GitHub/Gitea, SourceTree/VS Code/gitk. It is well-supported in UIs and thus adds to general transparency as well as data is kept joined with metadata. A couple of preliminary experiments were run and will be extended at the moment to gather more (performance and usability) experience. The results run on the test and production instances of the Git repositories in Freiburg (with suboptimal hardware setup) and Tübingen look promising, but a couple of issues are still to be tackled. To track provenance and complete history the workflow results (runs) will be committed via git into the ARC.