Example 1: aims at creating a computational model of carbon and water flow within a whole plant architecture
Example 2: aims at generating data management plan with minimal effort and making the data as open as possible
Example 1: carbon and water flow in plants
Example 2: data management plan
Example 1: Industry, politicians and students can also use the data for different purposes.
Example 2: The data acquired in the project can be used by a wide range of people with different purpose.
Information in this section is only used in DMP metadata and not used in the document
Data officers are also known as data stewards and curator.
software that legally remains the property of the organization, group, or individual who created it.
Action Number: |
$_FUNDINGPROGRAMME |
Action Acronym: |
$_PROJECT |
Action Title: |
$_PROJECT |
Creation Date: |
$_CREATIONDATE |
Modification Date: |
$_MODIFICATIONDATE |
DMP version: |
$_DMPVERSION |
#if$_EU The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store the data but to make it Findable, Accessible, Interoperable, and Reusable (FAIR).#if$_PROTECT We support open and FAIR data, however, we also consider the need to protect individual data sets. #endif$_PROTECT
The aim of this document is to provide guidelines on the principles of data management in the $_PROJECT and to specify which type of data will be stored, this will be achieved by using the responses to the EU questionnaire on Data Management Plan (DMP) as a DMP document.
The detailed DMP states how data will be handled during and after the project. The $_PROJECT DMP is prepared according to the Horizon 2020 and Horizon Europe online manual. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE
What is the purpose of the data collection/generation and its relation to the objectives of the project?
The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION
and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization
#endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary
#endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely
necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also
be informed about the provenance of data analysis information. Stakeholders must also be informed
about the provenance of data. It is therefore necessary to ensure that the data are well generated
and also well annotated with metadata using open standards, as laid out in the next section.
What types and formats of data will the project
generate/collect?
The $_PROJECT will collect and/or generate the following types of raw data : $_GENETIC, $_GENOMIC, $_TRANSCRIPTOMIC, $_RNASEQ, $_METABOLOMIC, $_PROTEOMIC, $_PHENOTYPIC, $_TARGETED, $_IMAGE, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA data which are related to $_STUDYOBJECT. In addition, the raw data will also be processed and modified using analytical pipelines, which may yield different results or include ad hoc data analysis parts. #if$_DATAPLANT These pipelines will be tracked in the DataPLANT ARC.#endif$_DATAPLANT Therefore, care will be taken to document and archive these resources (including the analytical pipelines) as well#if$_DATAPLANT relying on the expertise in the DataPLANT consortium#endif$_DATAPLANT.
Will you re-use any existing data and how?
The project builds on existing data sets and relies on them. #if$_RNASEQ|$_GENOMIC For example, without a proper genomic reference it is very difficult to analyze next-generation sequencing (NGS) data sets.#endif$_RNASEQ|$_GENOMIC It is also important to include existing data-sets on the expression and metabolic behavior of the $_STUDYOBJECT, and on existing background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can be gathered from reference databases for genomes/ and sequences, like the US National Center for Biotechnology Information: NCBI, European Bioinformatics Institute: EBI; DNA Data Bank of Japan: DDBJ. Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.
What is the origin of the data?
Public data will be extracted as described in the previous paragraph. For the $_PROJECT, specific data sets will be generated by the consortium partners.
Data of different types or representing different domains will be generated using unique approaches. For example:
Genetic data will be generated targeting crosses and breeding experiments, and will include recombination frequencies and crossover event that position genetic markers and quantitative trait loci that can be associated with physical genomic markers/variants.
Genomic data will be created from sequencing data, which will be processed to identify genes, regulatory elements, transposable elements, and physical markers such as SNPs, microsatellites and structural variants.
The origin and assembly of cloned DNA will include (a) source of original vector sequence with Add gene reference where available, and source of insert DNA (e.g., amplification by PCR from a given sample, or obtained from existing library), (b) cloning strategy (e.g., restriction endonuclease digests/ligation, PCR, TOPO cloning, Gibson assembly, LR recombination), and (c) verified DNA data sequence of final recombinant vector.
Methods of transcriptomics data collection will be selected from microarrays, quantitative PCR, Northern blotting, RNA immunoprecipitation, fluorescence in situ hybridization. RNA-Seq data will be collected in seperate methods.
RNA sequencing will be generated using short-read or long-read plantforms, either in house or outsourced to academic facilities or commercial services, and the raw data will be processed using estabilished biofirmatics piplines.
Metabolomic data will be generated by coupled chromatography and mass spectrometry using targeted or untargeted approaches.
Proteomic data will be generated using coupled chromatography and mass spectrometry for the analysis of protein abundance and protein identification, as well as additional techniques for structural analysis, the identification of post-translational modifications and the characterization of protein interactions.
Phenotypic data will be generated using phenotyping platforms and corresponding ontologies, including number/size of organs such as leaves, flowers, buds etc., size of whole plant, stem/root architecture (number of lateral branches/roots etc), organ structures/morphologies, quantitative metrics such as color, turgor, health/nutrition indicators, among others.
Targeted assays data (e.g. glucose and fructose concentrations or production/ultilization rates) will be generated using specific equipment and methods that are fully documented in the laboratory notebook.
Image data will be generated by equipment such as cameras, scanners, and microscopes combined with software. Original images which contain metadata such as exif photo information will be archived.
Model data will be generated by using software simulations. The complete workflow, which includes the environment, runtime, parameters, and results, will be documented and archived.
Computer code will be produced by programmers.
Excel data will be generated by data analysts by using MS Office or open-source software.
Data from previous projects such as $_PREVIOUSPROJECTS will be considered.
#endif$_PREVIOUSPROJECTSWhat is the expected size of the data?
We expect to generate $_RAWDATA GB of raw data and up to $_DERIVEDDATA GB of processed data.
To whom might it be useful ('data utility')?
The data will initially benefit the $_PROJECT partners, but will also be made available to selected stakeholders closely involved in the project, and then the scientific community working on $_STUDYOBJECT. $_DATAUTILITY In addition, the general public interested in $_STUDYOBJECT can also use the data after publication. The data will be disseminated according to the $_PROJECT's dissemination and communication plan, #if$_DATAPLANT which aligns with DataPLANT platform or other means#endif$_DATAPLANT
Are the data produced and/or used in the project discoverable with metadata, identifiable and locatable by means of a standard identification mechanism (e.g. persistent and unique identifiers such as Digital Object Identifiers)?
All datasets will be associated with unique identifiers and will be annotated with metadata. We will use Investigation, Study, Assay (ISA) specification for metadata creation. The $_PROJECT will rely on community standards plus additional recommendations applicable in the plant science, such as the #if$_PHENOTYPIC #if$_MIAPPE MIAPPE (Minimum Information About a Plant Phenotyping Experiment),#endif$_MIAPPE #endif$_PHENOTYPIC #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC These specific standard unlike cross-domain minimal sets such as the Dublin core, which mostly define the submitter and the general type of data, allow reusability by other researchers by defining properties of the plant (see the preceding section). However, minimal cross-domain annotations #if$_DUBLINCORE Dublin Core,#endif$_DUBLINCORE #if$_MARC21 MARC 21,#endif$_MARC21 also remain part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow individual releases to be tagged with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS Other standards such as $_OTHERSTANDARDINPUT are also adhered to. #endif$_OTHERSTANDARDS
What naming conventions do you follow?
Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.
Will search keywords be provided that optimize possibilities for re-use?
Keywords about the experiment and consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are provided where the ontology does not yet include such variables. #endif$_DATAPLANT
Do you provide clear version numbers?
To maintain data integrity and facilitate reanalysis, data sets will be allocated version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT This is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT
What metadata will be created? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.
We will use Investigation, Study, Assay (ISA) specification for metadata creation. #if$_RNASEQ|$_GENOMIC For specific data (e.g., RNASeq or genomic data), we use metadata templates from the end-point repositories. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) will also be used. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC The following metadata/ minimum informatin standards will be used to collect metadata: #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC #if$_METABOLOMIC #if$_METABOLIGHTS Metabolights submission compliant standards will be used for metabolomic data where this is acccepted by the consortium partners.#issuewarning some Metabolomics partners considers Metabolights not an accepted standard.#endissuewarning #endif$_METABOLIGHTS #endif$_METABOLOMIC As a part of plant research community, we use #if$_MIAPPE MIAPPE for phenotyping data in the broadest sense, but we will also be rely on #endif$_MIAPPE specific SOPs for additional annotations #if$_DATAPLANT that consider advanced DataPLANT annotation and ontologies. #endif$_DATAPLANT
Which data produced and/or used in the project will be made openly available as the default? If certain datasets cannot be shared (or need to be shared under restrictions), we explain why, clearly separating legal and contractual reasons from voluntary restrictions.
By default, all data sets from the $_PROJECT will be shared with the community and made openly available. However, before the data are released, all will be provided with an opportunity to check for potential IP (according to the consortium agreement and background IP rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY IP protection will be prioritized for datasets that offer the potential for exploitation.
Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.
How will the data be made accessible (e.g. by deposition in a repository)?
Data will be made available via the $_PROJECT platform using a user-friendly front end that allows
data visualization. Besides this it will be ensured that data which can be stored in
international discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI). #if$_DATAPLANT Whole datasets will also be wrapped into an ARC with allocated DOIs. The ARC and the converters provided by DataPLANT will ensure that the upload into the endpoint repositories is fast and easy. #endif$_DATAPLANT
What methods or software tools are needed to access the data?
#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY
#if!$_PROPRIETARY No specialized software will be needed to access the data, just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY
#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, arcCommander, and DataPLAN #endif$_DATAPLANT
Is documentation about the software needed to access the data included?
#if$_DATAPLANT DataPLANT resources are well described, and their setup is documented on a github project guide is provided on the GitHub project pages. #endif$_DATAPLANT All external software documentation will be duplicated locally and stored near the software.
Is it possible to include the relevant software (e.g. in open-source code)?
As stated above, the $_PROJECT will use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.
Where will the data and associated metadata, documentation and code be deposited? Preference should be given to certified repositories that support open access, where possible.
As noted above, specialized repositories will be used for common data types. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI).#if$_DATAPLANT The Whole datasets will also be wrapped into an ARC with allocated DOIs.#endif$_DATAPLANT.
Have you explored appropriate arrangements with the identified repository?
The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT , and this has been confirmed for data associated with DataPLANT #endif$_DATAPLANT. #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning
If there are restrictions on use, how will access be provided?
There are no restrictions beyond the IP screening described above, which is in line with European open data policies.
Is there a need for a data access committee?
There is no need for a data access committee.
Are there well described conditions for access (i.e. a machine-readable license)?
Yes, where possible, e.g. CC REL will be used for data not submitted to specialized repositories such as ENA.
How will the identity of the person accessing the data be ascertained?
Where data are shared only within the consortium, if the datasets are not yet finished or are undergoing IP checks, the data will be hosted internally and a username and password will be required for access (see GDPR rules). When the data are made public in EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements.
#if$_DATAPLANT Currently, data management relies on the annotated research context (ARC). It is password protected, so before any data or samples can be obtained, user authentication is required. #endif$_DATAPLANT
Are the data produced in the project interoperable, that is allowing data exchange and re-use between researchers, institutions, organizations, countries, etc. (i.e. adhering to standards for formats, as much as possible compliant with available (open) software applications, and in particular facilitating re-combinations with different datasets from different origins)?
Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used. However Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.
What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable?
As noted above, we foresee using minimal standards such as #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites #if$_MIAPPE and MIAPPE for phenotyping-like data #endif$_MIAPPE. The minimal information standards will allow the integration of data across projects, and its reuse according to established and tested protocols. We will also use ontological terms to enrich the data sets relying on free and open ontologies where possible. Additional ontology terms might be created and canonized during the $_PROJECT.
Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?
Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT
In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?
Common and open ontologies will be used, so this question does not apply.
How will the data be licensed to permit the widest re-use possible?
Open licenses, such as Creative Commons (CC), will be used whenever possible.
When will the data be made available for re-use? If an embargo is sought to give time to publish or seek patents, specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.
#if$_early Some raw data is made public as soon as it is collected and processed.#endif$_early #if$_beforepublication Relevant processed datasets are made public when the research findings are published.#endif$_beforepublication #if$_endofproject At the end of the project, all data without embargo period will be published.#endif$_endofproject #if$_embargo Data, which is subject to an embargo period, is not publicly accessible until the end of embargo period.#endif$_embargo #if$_request Data is made available upon request, allowing controlled sharing while ensuring responsible use.#endif$_request #if$_ipissue IP issues will be checked before publication. #endif$_ipissue All consortium partners will be encouraged to make data available before publication, openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC This will be implemented as soon as IP-related checks are complete.
Are the data produced and/or used in the project usable by third parties, in particular after the end of the project? If the re-use of some data is restricted, explain why.
There will be no restrictions once the data are made public.
How long is it intended that the data remains re-usable?
The data will be made available for many years#if$_DATAPLANT and ideally indefinitely after the end of the project#endif$_DATAPLANT.
Data submitted to repositories (as detailed above) e.g. ENA /PRIDE would be subject to local data storage regulation.
Are data quality assurance processes described?
The data will be checked and curated. #if$_DATAPLANT Furthermore, data will be quality controlled (QC) using automatic procedures as well as manual curation #endif$_DATAPLANT.
What are the costs for making data FAIR in your project?
The $_PROJECT will bear the costs of data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and data maintenance/security before transfer to public repositories. Subsequent costs are then borne by the operators of these repositories.
Additionally, costs for after publication storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.
How will these be covered? Note that costs related to open access to research data are eligible as part of the Horizon 2020 or Horizon Europe grant (if compliant with the Grant Agreement conditions).
The cost born by the $_PROJECT are covered by the project funding. Pre-existing structures #if$_DATAPLANT such as structures, tools, and knowledge laid down in the DataPLANT consortium#endif$_DATAPLANT will also be used.
Who will be responsible for data management in your project?
The responsible person will be $_DATAOFFICER of the $_PROJECT.
Are the resources for long term preservation discussed (costs and potential value, who decides and how/what data will be kept and for how long)?
The data officer #if$_PARTNERS or $_PARTNERS #endif$_PARTNERS will ultimately decides on the strategy to preserve data that are not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT when the project ends. This will be in line with EU guidlines, institute policies, and data sharing based on EU and international standards.
What provisions are in place for data security (including data recovery as well as secure storage and transfer of sensitive data)?
Online platforms will be protected by vulnerability scanning, two-factor authorization and daily automatic backups allowing immediate recovery. All partners holding confidential project data to use secure platforms with automatic backups and offsite secure copies. #if$_DATAPLANT DataHUB and ARCs have been generated in DataPLANT, data security will be imposed. This comprises secure storage, and the use of password and usernames is generally transferred via separate safe media.#endif$_DATAPLANT
Is the data safely stored in certified repositories for long term preservation and curation?
Data will be made available via the $_PROJECT platform using a user-friendly front end that allows
data visualization. Besides this it will be ensured that data which can be stored in international
discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
Are there any ethical or legal issues that can have an impact on data sharing? These can also be discussed in the context of an ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).
At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee to deal with data from plants, although we will diligently follow the Nagoya protocol on access and benefit sharing. #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya (🡺see Nagoya protocol). gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning
Is informed consent for data sharing and long term preservation included in questionnaires dealing with personal data?
The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning
Do you make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones?
Yes, the $_PROJECT will use common Research Data Management (RDM) tools such as #if$_DATAPLANT|$_NFDI resources developed by the NFDI of Germany,#endif$_DATAPLANT|$_NFDI #if$_FRENCH infrastructure developed by INRAe from France, #endif$_FRENCH #if$_EOSC and cloud service developed by EOSC (European Open Science Cloud)#endif$_EOSC .
#if$_DATAPLANT
ARC Annotated Research Context
#endif$_DATAPLANTCC Creative Commons
CC CEL Creative Commons Rights Expression Language
DDBJ DNA Data Bank of Japan
DMP Data Management Plan
DoA Description of Action
DOI Digital Object Identifier
EBI European Bioinformatics Institute
ENA European Nucleotide Archive
EU European Union
FAIR Findable Accessible Interoperable Reproducible
GDPR General data protection regulation (of the EU)
IP Intellectual Property
ISO International Organization for Standardization
MIAMET Minimal Information about Metabolite experiment
MIAPPE Minimal Information about Plant Phenotyping Experiment
MinSEQe Minimum Information about a high-throughput Sequencing Experiment
NCBI National Center for Biotechnology Information
NFDI National Research Data Infrastructure (of Germany)
NGS Next Generation Sequencing
RDM Research Data Management
RNASeq RNA Sequencing
SOP Standard Operating Procedures
SRA Short Read Archive
#if$_DATAPLANTSWATE Swate Workflow Annotation Tool for Excel
#endif$_DATAPLANTONP Oxford Nanopore
qRTPCR quantitative real time polymerase chain reaction
WP Work Package
Action Number: |
$_FUNDINGPROGRAMME |
Action Acronym: |
$_PROJECT |
Action Title: |
$_PROJECT |
Creation Date: |
$_CREATIONDATE |
Modification Date: |
$_MODIFICATIONDATE |
#if$_EU The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable, and Reusable (FAIR).#if$_PROTECT We support open and FAIR data, however, we also consider the need to protect individual data sets. #endif$_PROTECT
The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the EU questionnaire on Data Management Plan (DMP) as a DMP document.
The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modified according to the Horizon Europe and Horizon Europe online Manual. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE
Will you re-use any existing data and what will you re-use it for? State the reasons if re-use of any existing data has been considered but discarded.
The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT, but of course, also on existing characterization and the background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.
What types and formats of data will the project generate or re-use?
The $_PROJECT will collect and/or generate the following types of raw data : $_PHENOTYPIC, $_GENETIC, $_IMAGE, $_RNASEQ, $_GENOMIC, $_METABOLOMIC, $_PROTEoMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA data which are related to $_STUDYOBJECT. In addition, the raw data will also be processed and modified using analytical pipelines, which may yield different results or include ad hoc data analysis parts. #if$_DATAPLANT These pipelines will be tracked in the DataPLANT ARC.#endif$_DATAPLANT Therefore, care will be taken to document and archive these resources (including the analytical pipelines) as well#if$_DATAPLANT relying on the expertise in the DataPLANT consortium#endif$_DATAPLANT.
What is the purpose of the data generation or re-use and its relation to the objectives of the project?
The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analysis information. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.
What is the expected size of the data that you intend to generate or re-use?
We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.
What is the origin/provenance of the data, either generated or re-used?
Public data will be extracted as described in the previous paragraph. For the $_PROJECT, specific data sets will be generated by the consortium partners.
Data of different types or representing different domains will be generated using unique approaches. For example:
Genetic data will be generated targeting crosses and breeding experiments, and will include recombination frequencies and crossover event that position genetic markers and quantitative trait loci that can be associated with physical genomic markers/variants.
Genomic data will be created from sequencing data, which will be processed to identify genes, regulatory elements, transposable elements, and physical markers such as SNPs, microsatellites and structural variants.
The origin and assembly of cloned DNA will include (a) source of original vector sequence with Add gene reference where available, and source of insert DNA (e.g., amplification by PCR from a given sample, or obtained from existing library), (b) cloning strategy (e.g., restriction endonuclease digests/ligation, PCR, TOPO cloning, Gibson assembly, LR recombination), and (c) verified DNA data sequence of final recombinant vector.
Methods of transcriptomics data collection will be selected from microarrays, quantitative PCR, Northern blotting, RNA immunoprecipitation, fluorescence in situ hybridization. RNA-Seq data will be collected in seperate methods.
RNA sequencing will be generated using short-read or long-read plantforms, either in house or outsourced to academic facilities or commercial services, and the raw data will be processed using estabilished biofirmatics piplines.
Metabolomic data will be generated by coupled chromatography and mass spectrometry using targeted or untargeted approaches.
Proteomic data will be generated using coupled chromatography and mass spectrometry for the analysis of protein abundance and protein identification, as well as additional techniques for structural analysis, the identification of post-translational modifications and the characterization of protein interactions.
Phenotypic data will be generated using phenotyping platforms and corresponding ontologies, including number/size of organs such as leaves, flowers, buds etc., size of whole plant, stem/root architecture (number of lateral branches/roots etc), organ structures/morphologies, quantitative metrics such as color, turgor, health/nutrition indicators, among others.
Targeted assays data (e.g. glucose and fructose concentrations or production/ultilization rates) will be generated using specific equipment and methods that are fully documented in the laboratory notebook.
Image data will be generated by equipment such as cameras, scanners, and microscopes combined with software. Original images which contain metadata such as exif photo information will be archived.
Model data will be generated by using software simulations. The complete workflow, which includes the environment, runtime, parameters, and results, will be documented and archived.
Computer code will be produced by programmers.
Excel data will be generated by data analysts by using MS Office or open-source software.
Data from previous projects such as $_PREVIOUSPROJECTS will be considered.
#endif$_PREVIOUSPROJECTSTo whom might it be useful ('data utility'), outside your project?
The data will initially benefit the $_PROJECT partners, but will also be made available to selected stakeholders closely involved in the project, and then the scientific community working on $_STUDYOBJECT. $_DATAUTILITY In addition, the general public interested in $_STUDYOBJECT can also use the data after publication. The data will be disseminated according to the $_PROJECT's dissemination and communication plan#if$_DATAPLANT, which aligns with DataPLANT platform or other means#endif$_DATAPLANT.
$_DATAUTILITY
Will data be identified by a persistent identifier?
All data sets will receive unique identifiers, and they will be annotated with metadata.
Will rich metadata be provided to allow discovery? What metadata will be created? What disciplinary or general standards will be followed? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.
All datasets will be associated with unique identifiers and will be annotated with metadata. We will use Investigation, Study, Assay (ISA) specification for metadata creation. The $_PROJECT will rely on community standards plus additional recommendations applicable in the plant science, such as the #if$_PHENOTYPIC #if$_MIAPPE MIAPPE (Minimum Information About a Plant Phenotyping Experiment),#endif$_MIAPPE #endif$_PHENOTYPIC #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC These specific standard unlike cross-domain minimal sets such as the Dublin core, which mostly define the submitter and the general type of data, allow reusability by other researchers by defining properties of the plant (see the preceding section). However, minimal cross-domain annotations #if$_DUBLINCORE Dublin Core,#endif$_DUBLINCORE #if$_MARC21 MARC 21,#endif$_MARC21 also remain part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow individual releases to be tagged with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS Other standards such as $_OTHERSTANDARDINPUT are also adhered to. #endif$_OTHERSTANDARDS
Will search keywords be provided in the metadata to optimize the possibility for discovery and then potential re-use?
Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT
Will metadata be offered in such a way that it can be harvested and indexed?
To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT This is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.
Repository
Will the data be deposited in a trusted repository?
Data will be made available via the $_PROJECT platform using a user-friendly front end that allows
data visualization. Besides this it will be ensured that data which can be stored in
international discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
Have you explored appropriate arrangements with the identified repository where your data will be deposited?
The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon, as all the omics repositories of International Nucleotide Sequence Database Collaboration (INSDC) will be used. #endif$_DATAPLANT #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning
Does the repository ensure that the data is assigned an identifier? Will the repository resolve the identifier to a digital object?
Data will be stored in the following repositories:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
Data:
Will all data be made openly available? If certain datasets cannot be shared (or need to be shared under restricted access conditions), explain why, clearly separating legal and contractual reasons from intentional restrictions. Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if opening their data goes against their legitimate interests or other constraints as per the Grant Agreement.
By default, all data sets from the $_PROJECT will be shared with the community and made openly available. However, before the data are released, all will be provided with an opportunity to check for potential IP (according to the consortium agreement and background IP rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY IP protection will be prioritized for datasets that offer the potential for exploitation.
Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.
If an embargo is applied to give time to publish or seek protection of the intellectual property (e.g. patents), specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.
#if$_early Some raw data is made public as soon as it is collected and processed.#endif$_early #if$_beforepublication Relevant processed datasets are made public when the research findings are published.#endif$_beforepublication #if$_endofproject At the end of the project, all data without embargo period will be published.#endif$_endofproject #if$_embargo Data, which is subject to an embargo period, is not publicly accessible until the end of embargo period.#endif$_embargo #if$_request Data is made available upon request, allowing controlled sharing while ensuring responsible use.#endif$_request #if$_ipissue IP issues will be checked before publication. #endif$_ipissue All consortium partners will be encouraged to make data available before publication, openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC This will be implemented as soon as IP-related checks are complete.
Will the data be accessible through a free and standardized access protocol?
#if$_DATAPLANT DataPLANT stores data in the ARC, which is a git repo. The DataHUB shares data and metadata as a gitlab instance. The "Git" and "Web" protocol are opensourced and freely accessible. In addition, #endif$_DATAPLANT Zenodo and the endpoint repositories will also be used for access. In General, web-based protocols are free and standardized for access.
If there are restrictions on use, how will access be provided to the data, both during and after the end of the project?
There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.
How will the identity of the person accessing the data be ascertained?
In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally and username and password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements. #if$_DATAPLANT Currently, data management relies on the annotated research context ARC. It is password protected, so before any data can be obtained or samples generated an authentication needs to take place. #endif$_DATAPLANT
Is there a need for a data access committee (e.g. to evaluate/approve access requests to personal/sensitive data)?
Consequently, there is no need for a committee.
Metadata:
Will metadata be made openly available and licenced under a public domain dedication CC0, as per the Grant Agreement? If not, please clarify why. Will metadata contain information to enable the user to access the data?
Yes, where possible, e.g. CC REL will be used for data not submitted to specialized repositories such as ENA.
How long will the data remain available and findable? Will metadata be guaranteed to remain available after data is no longer available?
The data will be made available for many years#if$_DATAPLANT and
ideally indefinitely after the end of the project#endif$_DATAPLANT.
In any case data submitted to repositories (as detailed above) e.g. ENA /PRIDE would be
subject to local data storage regulation.
Will documentation or reference about any software be needed to access or read the data be included? Will it be possible to include the relevant software (e.g. in open source code)?
#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY #if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY #if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT #if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY
What data and metadata vocabularies, standards, formats or methodologies will you follow to make your data interoperable to allow data exchange and re-use within and across disciplines? Will you follow community-endorsed interoperability best practices? Which ones?
As noted above, we foresee using minimal standards such as the #if$_PHENOTYPIC #if$_MIAPPE MIAPPE (Minimum Information About a Plant Phenotyping Experiment),#endif$_MIAPPE #endif$_PHENOTYPIC #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC These specific standard unlike cross-domain minimal sets such as the Dublin core, which mostly define the submitter and the general type of data, allow reusability by other researchers by defining properties of the plant (see the preceding section). However, minimal cross-domain annotations #if$_DUBLINCORE Dublin Core,#endif$_DUBLINCORE #if$_MARC21 MARC 21,#endif$_MARC21 also remain part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow individual releases to be tagged with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS Other standards such as $_OTHERSTANDARDINPUT are also adhered to. #endif$_OTHERSTANDARDS
Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used. However Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf. Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANTIn case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies? Will you openly publish the generated ontologies or vocabularies to allow reusing, refining or extending them?
Common and open ontologies will be used. In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the DataPLANT biology ontology (DPBO) developed in DataPLANT. #endif$_DATAPLANT. Ontology databases such as OBO Foundry will be used to publish ontology. #if$_DATAPLANT The DPBO is also published in GitHub https://github.com/nfdi4plants/nfdi4plants_ontology #endif$_DATAPLANT.
Will your data include qualified references to other data (e.g. other data from your project, or datasets from previous research)?
The references to other data will be made in the form of DOI and ontology terms.
How will you provide documentation needed to validate data analysis and facilitate data re-use (e.g. readme files with information on methodology, codebooks, data cleaning, analyses, variable definitions, units of measurement, etc.)?
The documentation will be provided in the form of ISA (Investigation Study Assay) and CWL (Common Workflow Language). #if$_DATAPLANT Here, the $_PROJECT will build on the ARC container, which includes all the data, metadata, and documentations. #endif$_DATAPLANT
Will your data be made freely available in the public domain to permit the widest re-use possible? Will your data be licensed using standard reuse licenses, in line with the obligations set out in the Grant Agreement?
Yes, our data will be made freely available in the public domain to permit the widest re-use possible. Open licenses, such as Creative Commons (CC), will be used whenever possible.
Will the data produced in the project be useable by third parties, in particular after the end of the project?
There will be no restrictions once the data is made public.
Will the provenance of the data be thoroughly documented using the appropriate standards? Describe all relevant data quality assurance processes.
The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analysis information. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.
Describe all relevant data quality assurance processes. Further to the FAIR principles, DMPs should also address research outputs other than data, and should carefully consider aspects related to the allocation of resources, data security and ethical aspects.
The data will be checked and curated by using data collection protocol, personnel training, data cleaning, data analysis, and quality control #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation #endif$_DATAPLANT. Document all data quality assurance processes, including the data collection protocol, data cleaning procedures, data analysis techniques, and quality control measures. This documentation should be kept for future reference and should be made available to stakeholders upon request.
In addition to the management of data, beneficiaries should also consider and plan for the management of other research outputs that may be generated or re-used throughout their projects. Such outputs can be either digital (e.g. software, workflows, protocols, models, etc.) or physical (e.g. new materials, antibodies, reagents, samples, etc.).
In the current data management plan, any digital output including but not limited to software, workflows, protocols, models, documents, templates, notebooks are all treated as data. Therefore, all aforementioned digital objects are already described in detail. For the non-digital objects, the data management plan will be closely connected to the digitalisation of the physical objects. #if$_DATAPLANT $_PROJECT will build a workflow which connects the ARC with an electronic lab notebook in order to also manage the physical objects. #endif$_DATAPLANT
Beneficiaries should consider which of the questions pertaining to FAIR data above, can apply to the management of other research outputs, and should strive to provide sufficient detail on how their research outputs will be managed and shared, or made available for re-use, in line with the FAIR principles.
Open licenses, such as Creative Commons CC, will be used whenever possible even on the other digital objects.
What will the costs be for making data or other research outputs FAIR in your project (e.g. direct and indirect costs related to storage, archiving, re-use, security, etc.)?
The $_PROJECT will bear the costs of data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and data maintenance/security before transfer to public repositories. Subsequent costs are then borne by the operators of these repositories.
Additionally, costs for after publication storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.
How will these be covered? Note that costs related to research data/output management are eligible as part of the Horizon Europe grant (if compliant with the Grant Agreement conditions)
The cost born by the $_PROJECT are covered by the project funding. Pre-existing structures #if$_DATAPLANT such as structures, tools, and knowledge laid down in the DataPLANT consortium#endif$_DATAPLANT will also be used.
Who will be responsible for data management in your project?
The responsible person will be $_DATAOFFICER of the $_PROJECT.
How will long term preservation be ensured? Discuss the necessary resources to accomplish this (costs and potential value, who decides and how, what data will be kept and for how long)?
The data officer #if$_PARTNERS or $_PARTNERS #endif$_PARTNERS will ultimately decides on the strategy to preserve data that are not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT when the project ends. This will be in line with EU guidlines, institute policies, and data sharing based on EU and international standards.
What provisions are or will be in place for data security (including data recovery as well as secure storage/archiving and transfer of sensitive data)?
Online platforms will be protected by vulnerability scanning, two-factor authorization and daily automatic backups allowing immediate recovery. All partners holding confidential project data to use secure platforms with automatic backups and offsite secure copies. #if$_DATAPLANT DataHUB and ARCs have been generated in DataPLANT, data security will be imposed. This comprises secure storage, and the use of password and usernames is generally transferred via separate safe media.#endif$_DATAPLANT
Will the data be safely stored in trusted repositories for long term preservation and curation?
Data will be made available via the $_PROJECT platform using a user-friendly front end that allows
data visualization. Besides this it will be ensured that data which can be stored in
international discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
Are there, or could there be, any ethics or legal issues that can have an impact on data sharing? These can also be discussed in the context of the ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).
At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered . #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya (🡺see Nagoya protocol) gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning
Will informed consent for data sharing and long term preservation be included in questionnaires dealing with personal data?
The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning
Do you, or will you, make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones (please list and briefly describe them)?
Yes, the $_PROJECT will use common Research Data Management (RDM) tools such as #if$_DATAPLANT|$_NFDI resources developed by the NFDI of Germany,#endif$_DATAPLANT|$_NFDI #if$_FRENCH infrastructure developed by INRAe from France, #endif$_FRENCH #if$_EOSC and cloud service developed by EOSC (European Open Science Cloud)#endif$_EOSC .
#if$_DATAPLANT
ARC Annotated Research Context
#endif$_DATAPLANTCC Creative Commons
CC CEL Creative Commons Rights Expression Language
DDBJ DNA Data Bank of Japan
DMP Data Management Plan
DoA Description of Action
DOI Digital Object Identifier
EBI European Bioinformatics Institute
ENA European Nucleotide Archive
EU European Union
FAIR Findable Accessible Interoperable Reproducible
GDPR General data protection regulation (of the EU)
IP Intellectual Property
ISO International Organization for Standardization
MIAMET Minimal Information about Metabolite experiment
MIAPPE Minimal Information about Plant Phenotyping Experiment
MinSEQe Minimum Information about a high-throughput Sequencing Experiment
NCBI National Center for Biotechnology Information
NFDI National Research Data Infrastructure (of Germany)
NGS Next Generation Sequencing
RDM Research Data Management
RNASeq RNA Sequencing
SOP Standard Operating Procedures
SRA Short Read Archive
#if$_DATAPLANTSWATE Swate Workflow Annotation Tool for Excel
#endif$_DATAPLANTONP Oxford Nanopore
qRTPCR quantitative real time polymerase chain reaction
WP Work Package
The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable and Reusable (FAIR). #if$_PROTECT Open and FAIR data, however, considers the need to protect individual data sets. #endif$_PROTECT
The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the DFG Data Management Plan (DMP) checklist to generate a DMP document.
The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modified according to the DFG data management checklist. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE
1.2 How does your project generate new data?
Data of different types or of different domains will be generated differently. For example:
Methods of transcriptomics data collection will be selected from microarrays, quantitative PCR, Northern blotting, RNA immunoprecipitation, fluorescence in situ hybridization. RNA-Seq data will be collected in seperate methods.
RNA sequencing will be generated using short-read or long-read plantforms, either in house or outsourced to academic facilities or commercial services, and the raw data will be processed using estabilished biofirmatics piplines.
Metabolomic data will be generated using chromatography coupled to mass spectrometry and from enzyme platforms mostly.
Proteomic data will be generated using an EU platform which are in line with community standards.
Image data will be generated by using equipment (cameras, scanners, and microscopes) or software. Original images which contain metadata such as exif photo information will be archived.
Genomic data will be created from sequencing data. The sequencing data will be collected by Next Generation Sequencing (NGS) equipment#if$_PARTNERS or get from parterners#endif$_PARTNERS. Then the sequencing data will be processed to get the genomic data.
Genetic data will be generated by using Next Generation Sequencing (NGS) equipment.
Targeted assays (e.g. glucose and fructose content) will be generated using specific equipment or experiments. The procedure is fully documented in the lab book.
Models data will be generated by software simulations. The complete workflow, which includes the environment, runtime, parameter and results will be documented and achieved.
The code data will be generated by programmers.
The Excel data will be generated by experimentalists or data analysts by using Office or open-source software.
The cloned DNA data will be generated by using a sequencing tool.
Phenotypic data will be generated using phenotyping platforms.
The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analysis information. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.
Public data will be extracted as described in paragraph 1.3. For the $_PROJECT, specific data sets will be generated by the consortium partners.
1.3 Is existing data reused?
The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT, but of course, also on existing characterization and the background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.
1.4 Which data types (in terms of data formats like image data, text data or measurement data) arise in your project and in what way are they further processed?
We foresee that the following data about $_STUDYOBJECT will be collected and generated at the very least: $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ, $_IMAGE, $_PROTEOMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA and result data. Furthermore, data derived from the original raw data sets will also be collected. This is important, as different analytical pipelines might yield different results or include ad-hoc data analysis parts#if$_DATAPLANT and these pipelines will be tracked in the DataPLANT ARC#endif$_DATAPLANT. Therefore, specific care will be taken, to document and archive these resources (including the analytic pipelines) as well#if$_DATAPLANT relying on the vast expertise in the DataPLANT consortium #endif$_DATAPLANT.
1.5 To what extent do these arise or what is the anticipated data volume?We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.
All datasets will be associated with unique identifiers and will be annotated with metadata. We will use Investigation, Study, Assay (ISA) specification for metadata creation. The $_PROJECT will rely on community standards plus additional recommendations applicable in the plant science, such as the #if$_PHENOTYPIC #if$_MIAPPE MIAPPE (Minimum Information About a Plant Phenotyping Experiment),#endif$_MIAPPE #endif$_PHENOTYPIC #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC These specific standard unlike cross-domain minimal sets such as the Dublin core, which mostly define the submitter and the general type of data, allow reusability by other researchers by defining properties of the plant (see the preceding section). However, minimal cross-domain annotations #if$_DUBLINCORE Dublin Core,#endif$_DUBLINCORE #if$_MARC21 MARC 21,#endif$_MARC21 also remain part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow individual releases to be tagged with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS Other standards such as $_OTHERSTANDARDINPUT are also adhered to. #endif$_OTHERSTANDARDS
Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT
In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT
The $_PROJECT aims at the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analysis information. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards. Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.
To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT this is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT
As mentioned above, we foresee using e.g. #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites#if$_MIAPPE as well as MIAPPE for phenotyping-like data#endif$_MIAPPE. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.
The data will be checked and curated through the project period. #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation. #endif$_DATAPLANT Phd students and lab professionals will be responsible for the first-hand quality control. Afterwards, the data will be checked and annotated by $_DATAOFFICER. #if$_RNASEQ|$_GENOMIC FastQC will be conducted on the base-calling. #endif$_RNASEQ|$_GENOMIC Before publication, the data will be controlled again.
The $_PROJECT will use common Research Data Management (RDM) tools such as #if$_DATAPLANT|$_NFDI resources developed by the NFDI of Germany,#endif$_DATAPLANT|$_NFDI #if$_FRENCH infrastructure developed by INRAe from France, #endif$_FRENCH #if$_EOSC and cloud service developed by EOSC (European Open Science Cloud)#endif$_EOSC .
#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY
#if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. As no proprietary software is needed, no documentation needs to be provided. #endif!$_PROPRIETARY
#if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT
#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT
As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies: #if$_GENETIC #if$_GENBANK NCBI-GenBank,#endif$_GENBANK #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS #endif$_GENETIC #if$_TRANSCRIPTOMIC|$_GENETIC #if$_SRA NCBI-SRA,#endif$_SRA #if$_GEO NCBI-GEO,#endif$_GEO #endif$_TRANSCRIPTOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE #if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE #if$_METABOLOMIC #if$_METABOLIGHTS EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT #endif$_METABOLOMIC #if$_PROTEOMIC #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC #if$_PHENOTYPIC #if$_edal e!DAL-PGP (Plant Genomics & Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC #if$_OTHEREP and $_OTHEREP will also be used to store data and the data will be processed there as well.#endif$_OTHEREP
Data will be made available for many years#if$_DATAPLANT and potentially indefinitely after the end of the project#endif$_DATAPLANT.
In any case data submitted to international discipline related repositories which use specialized technologies (as detailed above) e.g. ENA /Pride would be subject to local data storage regulation.
#if$_DATAPLANT In DataPLANT, data management relies on the Annotated Research Context (ARC). It is password protected, so before any data can be obtained or samples generated, an authentication needs to take place. #endif$_DATAPLANT
In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. this is the case for ENA as well and both are in line with GDPR requirements.
There will be no restrictions once the data is made public.
At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered. #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya (🡺see Nagoya protocol) gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning
The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning
Once data is transferred to the $_PROJECT platform#if$_DATAPLANT and ARCs have been generated in DataPLANT#endif$_DATAPLANT, data security will be imposed. This comprises secure storage, and the use of passwords and usernames is generally transferred via separate safe media.
Open licenses, such as Creative Commons (CC), will be used whenever possible.
Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used; however, Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components in form#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.
The data will be useful for the $_PROJECT partners, the scientific community working on $_STUDYOBJECT or the general public interested in $_STUDYOBJECT. Hence, the $_PROJECT also strives to collect the data that has been disseminated and potentially advertise it#if$_DATAPLANT e.g. through the DataPLANT platform or other means #endif$_DATAPLANT, if it is not included in a publication anyway, which is the most likely form of dissemination.
By default, all data sets from the $_PROJECT will be shared with the community and made openly available. This is, however, after partners have had the ability to check for IP protection (according to agreements and background rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY However, all partners also strive for IP protection of data sets which will be tested and due diligence will be given.
Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.
#if$_DATAPLANT As the $_PROJECT is closely aligned with DataPLANT, the ARC converter and DataHUB will be used to find the end-point repositories and upload to the repositories automatically. #endif$_DATAPLANT
Data will be made available via the $_PROJECT platform using a user-friendly front end that allows
data visualization. Besides this it will be ensured that data which can be stored in
international discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage Archive,#endif$_BIOIMAGE
#if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular interactions),#endif$_INTACT
#endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE #if$_PDB
PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities of
Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant Genomics &
Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon. #endif$_DATAPLANT #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning
There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.
The $_PARTNERS decides on preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT#endif$_DATAPLANT after project end. This will be in line with EU institute policies and data sharing based on EU and international standards.
#if$_early Some raw data is made public as soon as it is collected and processed.#endif$_early #if$_beforepublication Relevant processed datasets are made public when the research findings are published.#endif$_beforepublication #if$_endofproject At the end of the project, all data without embargo period will be published.#endif$_endofproject #if$_embargo Data, which is subject to an embargo period, is not publicly accessible until the end of embargo period.#endif$_embargo #if$_request Data is made available upon request, allowing controlled sharing while ensuring responsible use.#endif$_request #if$_ipissue IP issues will be checked before publication. #endif$_ipissue All consortium partners will be encouraged to make data available before publication, openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC This will be implemented as soon as IP-related checks are complete.
The responsible will be $_DATAOFFICER as data Officer. The data responsible(s) (data officer#if$_PARTNERS or $_PARTNERS #endif$_PARTNERS) decides on the preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT after the project end. This will be in line with EU institute policies, and data sharing based on EU and international standards.
The costs comprise data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and maintenance on the $_PROJECT´s side.
Additionally, last-level costs for storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.
A large part of the cost is covered by the $_PROJECT #if$_DATAPLANT and the structures, tools and knowledge laid down in the DataPLANT consortium. #endif$_DATAPLANT
As applicable, $_DATAOFFICER, who is responsible for ongoing data maintenance will also take care of it after the finish of the $_PROJECT. #if$_DATAPLANT DataPLANT as external data archives may provide such services in some cases. #endif$_DATAPLANT
#if$_DATAPLANT
ARC Annotated Research Context
#endif$_DATAPLANTCC Creative Commons
CC CEL Creative Commons Rights Expression Language
DDBJ DNA Data Bank of Japan
DMP Data Management Plan
DoA Description of Action
DOI Digital Object Identifier
EBI European Bioinformatics Institute
ENA European Nucleotide Archive
EU European Union
FAIR Findable Accessible Interoperable Reproducible
GDPR General data protection regulation (of the EU)
IP Intellectual Property
ISO International Organization for Standardization
MIAMET Minimal Information about Metabolite experiment
MIAPPE Minimal Information about Plant Phenotyping Experiment
MinSEQe Minimum Information about a high-throughput Sequencing Experiment
NCBI National Center for Biotechnology Information
NFDI National Research Data Infrastructure (of Germany)
NGS Next Generation Sequencing
RDM Research Data Management
RNASeq RNA Sequencing
SOP Standard Operating Procedures
SRA Short Read Archive
#if$_DATAPLANTSWATE Swate Workflow Annotation Tool for Excel
#endif$_DATAPLANTONP Oxford Nanopore
qRTPCR quantitative real time polymerase chain reaction
WP Work Package
This practical guide of data management in the $_PROJECT should be considered as a minimum description, leaving flexibility to include additional actions of specific domain or to national or local legislation.#if$_EU The $_PROJECT will follow EU FAIR principles. #endif$_EU
The practical guide of data management in the $_PROJECT aims at providing a complete walkthrough for the researcher. The contents are customized based on the user input in the Data Management Plant Generator (DMPG). The practices in this guide are customized to fit related legal, ethical, standardization and funding body requirements. The suitable practices will cover all steps of a data management life-cycle:
Data acquisition:
Data generation
Data should be generated by devices that are compatible with the open-source format. The $_STUDYOBJECT should be compliant to biodiversity protocols. The protocols used to collect $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ data about $_STUDYOBJECT will be stored#if$_DATAPLANT in the assays folder of ARC repositories.#endif$_DATAPLANT#if!$_DATAPLANT in a FAIR data storage. #endif!$_DATAPLANT
Data collection
The data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook will be used to ensure enough metadata is recorded and guarantees that the data can be further reused.#endif$_DATAPLANT
Data Organization
The data organization process is conducted by $_DATAOFFICER. The detailed organization method and procedure are reported to the PIs. #if$_DATAPLANT The data organization will profit from the knowledge-base and data-base of DataPLANT, elastic search will be used to find better ways to organize the data. #endif$_DATAPLANT
Annotation
Workflow documentation
Because the data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook was used to ensure enough metadata is recorded and guarantees that the data can be further reused. The workflow can be retrieved from the electronic workbook by using the toolkits provided from the DataPLANT such as SWATE and arccommander. #endif$_DATAPLANT
Metadata completion
In case some of the metadata is still missing from the documentation from the experimental scientists and data officer. #if$_DATAPLANT Raw data identifier and parsers provided by DataPLANT will be used to get meta data directly from the raw data file. The metadata collected from the raw data file can also be used to validate the metadata previously collected in case there are any mistakes. #endif$_DATAPLANT We foresee using #if$_RNASEQ|$_GENOMIC e.g.#if$_MINSEQE MinSEQe for sequencing data and#endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites as well as MIAPPE for phenotyping like data. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.
Maintenance:
Data storage
Raw data collected in previous steps are stored immediately by using#if$_DATAPLANT the infrastructure of DataPLANT #endif$_DATAPLANT #if!$_DATAPLANT in a secure infrastructure. ARC (Annotated Research Context) is used as a container to store the raw data as well as metadata and workflow.#endif!$_DATAPLANT
Data curation
#if$_DATAPLANT Data stored in ARC is curated regularly as long as there are needs for update or revision.#endif$_DATAPLANT #if!$_DATAPLANT Data is curated regularly as long as there are needs for update or revision.#endif!$_DATAPLANT
Publication and sharing
Data publishing
Data will be made available via the $_PROJECT platform using a user-friendly front
end that allows data visualization. Besides this it will be ensured that data which
can be stored in international discipline related repositories which use specialized technologies: #if$_GENETIC #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_GENETIC #if$_TRANSCRIPTOMIC|$_GENETIC #if$_SRA NCBI-SRA,#endif$_SRA #if$_GEO
NCBI-GEO,#endif$_GEO #endif$_TRANSCRIPTOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_BIOIMAGE
EBI-BioImage Archive,#endif$_BIOIMAGE #if$_IDR IDR (Image Data Resource),#endif$_IDR
#endif$_IMAGE #if$_METABOLOMIC #if$_METABOLIGHTS EBI-MetaboLights,#endif$_METABOLIGHTS
#if$_METAWORKBENCH Metabolomics Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact
(Molecular interactions),#endif$_INTACT #endif$_METABOLOMIC #if$_PROTEOMIC #if$_PRIDE
EBI-PRIDE,#endif$_PRIDE #if$_PDB PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI
Chebi (Chemical Entities of Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC #if$_PHENOTYPIC #if$_edal e!DAL-PGP (Plant Genomics & Phenomics Research Data Repository)
#endif$_edal #endif$_PHENOTYPIC
#if$_OTHEREP and $_OTHEREP will also be used to store data and the data will be
processed there as well.#endif$_OTHEREP
Data sharing
In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements.
#if$_GENOMIC
#endif$_GENOMIC
#if$_RNASEQ
#endif$_RNASEQ
#if$_METABOLOMIC
#endif$_METABOLOMIC
#if$_PROTEOMIC
#endif$_PROTEOMIC
Projektname: $_PROJECT
Forschungsförderer: Bundesministerium für Bildung und Forschung
Förderprogramm: $_FUNDINGPROGRAMME
FKZ: $_DMPVERSION
Projektkoordinator: $_USERNAME
Kontaktperson Datenmanagement: $_DATAOFFICER
Kontakt: $_EMAIL
Projektbeschreibung:
Das $_PROJECT hat folgendes Ziel: $_PROJECTAIM. Daher sind Datenerhebung#if!$_VVISUALIZATION
und Integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, Integration und Visualisierung
#endif$_VVISUALIZATION#if$_DATAPLANT unter Verwendung der DataPLANT ARC-Struktur absolut
notwendig,#endif$_DATAPLANT#if!$_DATAPLANT durch einen standardisierten
Datenmanagementprozess absolut notwendig,#endif!$_DATAPLANT da die Daten nicht nur zum
Verständnis von Prinzipien verwendet werden, sondern auch über die Herkunft der analysierten
Daten informiert werden muss. Stakeholder müssen ebenfalls über die Herkunft der Daten
informiert werden. Es ist daher notwendig sicherzustellen, dass die Daten gut generiert und
auch gut mit Metadaten unter Verwendung offener Standards annotiert werden, wie im nächsten
Abschnitt dargelegt.
Das $_PROJECT wird die folgenden Arten von Rohdaten sammeln und/oder generieren:
$_PHENOTYPIC, $_GENETIC, $_IMAGE, $_RNASEQ, $_GENOMIC, $_METABOLOMIC, $_PROTEOMIC,
$_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA Daten, die sich auf $_STUDYOBJECT
beziehen. Zusätzlich werden die Rohdaten auch durch analytische Pipelines verarbeitet und
modifiziert, was zu unterschiedlichen Ergebnissen führen kann oder ad-hoc-Datenanalyse-Teile
umfassen kann. #if$_DATAPLANT Diese Pipelines werden im DataPLANT ARC
verfolgt.#endif$_DATAPLANT Daher wird darauf geachtet, diese Ressourcen (einschließlich der
analytischen Pipelines) zu dokumentieren und zu archivieren#if$_DATAPLANT unter Rückgriff
auf die Expertise im DataPLANT-Konsortium#endif$_DATAPLANT.
Erstellungsdatum: $_CREATIONDATE
Änderungsdatum: $_MODIFICATIONDATE
Zu beachtende Vorgaben:
#if$_EU Das $_PROJECT ist Teil der Open Data Initiative (ODI) der EU. #endif$_EU Um optimal von offenen Daten zu profitieren, ist es notwendig, die Daten nicht nur zu speichern, sondern sie auch auffindbar, zugänglich, interoperabel und wiederverwendbar (FAIR) zu machen. #if$_PROTECT Wir unterstützen offene und FAIR-Daten, berücksichtigen jedoch auch die Notwendigkeit, einzelne Datensätze zu schützen. #endif$_PROTECT
#if$_DATAPLANT Durch die Implementierung von DataPLANT können Forscher sicherstellen, dass alle relevanten Richtlinien und Anforderungen im Zusammenhang mit dem Datenmanagement eingehalten werden, was zu einer höheren Qualität und Zuverlässigkeit der Forschungsdaten führt. #endif$_DATAPLANT
Datenerhebung
Öffentliche Daten werden wie im vorherigen Absatz beschrieben
extrahiert. Für das $_PROJECT werden spezifische Datensätze von den Konsortialpartnern
generiert.
Daten unterschiedlicher Typen oder aus verschiedenen Bereichen werden mit
einzigartigen Ansätzen generiert. Zum Beispiel:
Genetische Daten werden durch Kreuzungen und Zuchtexperimente
generiert und umfassen Rekombinationsfrequenzen und Crossover-Ereignisse, die
genetische Marker und quantitative Merkmalsloci positionieren können, die mit
physischen genomischen Markern/Varianten assoziiert werden können.
Genomische Daten werden aus Sequenzdaten erstellt, die verarbeitet
werden, um Gene, regulatorische Elemente, transponierbare Elemente und physikalische
Marker wie SNPs, Mikrosatelliten und strukturelle Varianten zu identifizieren.
Der Ursprung und die Zusammenstellung der klonierten DNA umfassen (a)
die Quelle der ursprünglichen Vektorsequenz mit Add-Gene-Referenz, sofern verfügbar,
und die Quelle der Insert-DNA (z.B. Amplifikation durch PCR aus einer bestimmten
Probe oder aus einer vorhandenen Bibliothek), (b) die Klonierungsstrategie (z.B.
Restriktionsendonuklease-Verdau/Ligation, PCR, TOPO-Klonierung, Gibson-Assembly,
LR-Rekombination), und (c) die verifizierte DNA-Sequenz des finalen rekombinanten
Vektors.
Methoden zur Erfassung von Transkriptomik-Daten werden aus
Mikroarrays, quantitativer PCR, Northern Blotting, RNA-Immunpräzipitation und
Fluoreszenz-in-situ-Hybridisierung ausgewählt. RNA-Seq-Daten werden mit separaten
Methoden gesammelt.
RNA-Sequenzierung wird unter Verwendung von Short-Read- oder
Long-Read-Plattformen entweder intern oder an akademische Einrichtungen oder
kommerzielle Dienste ausgelagert und die Rohdaten werden mit etablierten
bioinformatischen Pipelines verarbeitet.
Metabolomische Daten werden durch gekoppelte Chromatographie und
Massenspektrometrie unter Verwendung gezielter oder ungezielter Ansätze
generiert.
Proteomische Daten werden durch gekoppelte Chromatographie und
Massenspektrometrie zur Analyse der Proteinmenge und -identifikation sowie durch
zusätzliche Techniken zur Strukturanalyse, zur Identifizierung posttranslationaler
Modifikationen und zur Charakterisierung von Proteininteraktionen generiert.
Phänotypische Daten werden mit Hilfe von Phänotypisierungsplattformen
und entsprechenden Ontologien generiert, einschließlich Anzahl/Größe von Organen wie
Blätter, Blumen, Knospen usw., Größe der gesamten Pflanze,
Stängel/Wurzel-Architektur (Anzahl der seitlichen Zweige/Wurzeln usw.),
Organstrukturen/Morphologien, quantitativen Metriken wie Farbe, Turgor,
Gesundheits-/Nährstoffindikatoren und anderen.
Gezielte Assays-Daten (z. B. Glukose- und Fruktosekonzentrationen oder
Produktions-/Nutzungsraten) werden mit spezifischen Geräten und Methoden generiert,
die im Laborbuch vollständig dokumentiert sind.
Bilddaten werden durch Geräte wie Kameras, Scanner und Mikroskope in
Kombination mit Software generiert. Originalbilder, die Metadaten wie
EXIF-Fotoinformationen enthalten, werden archiviert.
Modelldaten werden durch Softwaresimulationen generiert. Der
vollständige Workflow, einschließlich der Umgebung, Laufzeit, Parameter und
Ergebnisse, wird dokumentiert und archiviert.
Computercode wird von Programmierern erstellt.
Excel-Tabellen werden durch Ausfüllen spezifischer Dateien erstellt,
die Feldbeobachtungen oder andere digitale Erhebungen enthalten.
Daten aus früheren Projekten wie $_PREVIOUSPROJECTS werden
berücksichtigt. Wir erwarten die Erzeugung von $_RAWDATA GB Rohdaten und bis zu
$_DERIVEDDATA GB verarbeiteten Daten.
#if$_GENETIC
#if$_PREVIOUSPROJECTS
Datenspeicherung:
#if$_DATAPLANT In DataPLANT, die Datenspeicherung basiert auf dem Annotated Research Context (ARC). Dieser ist passwortgeschützt, daher muss vor dem Erhalt von Daten oder der Generierung von Proben eine Authentifizierung erfolgen. #endif$_DATAPLANT
Online-Plattformen werden durch Schwachstellen-Scans, Zwei-Faktor-Authentifizierung und tägliche automatische Backups geschützt, die eine sofortige Wiederherstellung ermöglichen. Alle Partner, die vertrauliche Projektdaten halten, nutzen sichere Plattformen mit automatischen Backups und sicheren externen Kopien. #if$_DATAPLANT DataHUB und ARCs wurden in DataPLANT generiert, Datensicherheit wird durchgesetzt. Dies umfasst sichere Speicherung, und die Verwendung von Passwörtern und Benutzernamen wird generell über separate sichere Medien übertragen. #endif$_DATAPLANT
Das $_PROJECT trägt die Kosten für die Datenkuratierung,
#if$_DATAPLANT ARC-Konsistenzprüfungen, #endif$_DATAPLANT und die Datenwartung/-sicherheit
vor der Übertragung an öffentliche Repositorien. Nachfolgende Kosten werden dann von den
Betreibern dieser Repositorien getragen.
Zusätzlich werden Kosten für die Speicherung nach der Veröffentlichung von den
Endpunkt-Repositorien (z.B. ENA) getragen, jedoch nicht vom $_PROJECT oder seinen
Mitgliedern, sondern durch das Betriebsbudget dieser Repositorien.
#if$_GENETIC Für genetische Daten: #if$_GENBANK
NCBI-GenBank,#endif$_GENBANK #if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA EBI-ENA,#endif$_ENA
#if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS #if$_GEO NCBI-GEO,#endif$_GEO
#endif$_GENETIC
#if$_TRANSCRIPTOMIC Für Transkriptomdaten: #if$_SRA NCBI-SRA,#endif$_SRA
#if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_TRANSCRIPTOMIC
#if$_IMAGE Für Bilddaten: #if$_BIOIMAGE EBI-BioImage
Archive,#endif$_BIOIMAGE #if$_IDR IDR (Image Data Resource),#endif$_IDR #endif$_IMAGE
#if$_METABOLOMIC Für Metabolomdaten: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular
interactions),#endif$_INTACT #endif$_METABOLOMIC
#if$_PROTEOMIC Für Proteomikdaten: #if$_PRIDE EBI-PRIDE,#endif$_PRIDE
#if$_PDB PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI Chebi (Chemical Entities
of Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC
#if$_PHENOTYPIC Für phänotypische Daten: #if$_edal e!DAL-PGP (Plant
Genomics & Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC
Die Dateibenennung erfolgt nach folgendem Standard:
Datenvariablen werden mit Standardnamen versehen. Zum Beispiel werden Gene, Proteine und Metaboliten gemäß anerkannter Nomenklatur und Konventionen benannt. Diese werden nach Möglichkeit auch mit funktionalen Ontologien verknüpft. Datensätze werden ebenfalls sinnvoll benannt, um die Lesbarkeit durch Menschen zu gewährleisten. Pflanzennamen umfassen traditionelle Namen, Binomialnamen und alle Stamm-/Kultivar-/Unterart-/Sortenbezeichner.
Datendokumentation
Wir verwenden die Investigation, Study, Assay (ISA) Spezifikation zur Metadaten-Erstellung. #if$_RNASEQ|$_GENOMIC Für spezifische Daten (z.B. RNASeq oder genomische Daten) verwenden wir Metadatentemplates der Endpunkt-Repositorien. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) wird ebenfalls verwendet. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Die folgenden Metadaten-/Mindestinformationsstandards werden zur Sammlung von Metadaten verwendet: #if$_GENOMIC|$_GENETIC #if$_MIXS MIxS (Minimum Information about any (X) Sequence),#endif$_MIXS #if$_MIGSEU MigsEu (Minimum Information about a Genome Sequence: Eucaryote),#endif$_MIGSEU #if$_MIGSORG MigsOrg (Minimum Information about a Genome Sequence: Organelle),#endif$_MIGSORG #if$_MIMS MIMS (Minimum Information about Metagenome or Environmental),#endif$_MIMS #if$_MIMARKSSPECIMEN MIMARKSSpecimen (Minimal Information about a Marker Specimen: Specimen),#endif$_MIMARKSSPECIMEN #if$_MIMARKSSURVEY MIMARKSSurvey (Minimal Information about a Marker Specimen: Survey),#endif$_MIMARKSSURVEY #if$_MISAG MISAG (Minimum Information about a Single Amplified Genome),#endif$_MISAG #if$_MIMAG MIMAG (Minimum Information about Metagenome-Assembled Genome),#endif$_MIMAG #endif$_GENOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_MINSEQE MINSEQE (Minimum Information about a high-throughput SEQuencing Experiment),#endif$_MINSEQE #endif$_TRANSCRIPTOMIC #if$_TRANSCRIPTOMIC #if$_MIAME MIAME (Minimum Information About a Microarray Experiment),#endif$_MIAME #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_REMBI REMBI (Recommended Metadata for Biological Images),#endif$_REMBI #endif$_IMAGE #if$_PROTEOMIC #if$_MIAPE MIAPE (Minimum Information About a Proteomics Experiment),#endif$_MIAPE #if$_MIMIX MIMix (Minimum Information about any (X) Sequence),#endif$_MIMIX #endif$_PROTEOMIC #if$_METABOLOMIC #if$_METABOLIGHTS Metabolights-Einreichungskonforme Standards werden für metabolomische Daten verwendet, wo dies von den Konsortialpartnern akzeptiert wird.#issuewarning Einige Metabolomik-Partner betrachten Metabolights nicht als akzeptierten Standard.#endissuewarning #endif$_METABOLIGHTS #endif$_METABOLOMIC Als Teil der Pflanzenforschungsgemeinschaft verwenden wir #if$_MIAPPE MIAPPE für Phänotypisierungsdaten im weitesten Sinne, werden aber auch auf #endif$_MIAPPE spezifische SOPs für zusätzliche Annotationen #if$_DATAPLANT zurückgreifen, die fortgeschrittene DataPLANT-Annotationen und Ontologien berücksichtigen. #endif$_DATAPLANT
In dem Fall, dass einige Metadaten noch fehlen, werden diese von den experimentellen Wissenschaftlern und dem Datenbeauftragten dokumentiert. #if$_DATAPLANT Rohdaten-Identifier und Parser, die von DataPLANT bereitgestellt werden, um Metadaten direkt aus der Rohdatei zu extrahieren. Die aus der Rohdatei gesammelten Metadaten können auch verwendet werden, um die zuvor gesammelten Metadaten zu validieren, falls Fehler auftreten. #endif$_DATAPLANT Wir sehen vor, #if$_RNASEQ|$_GENOMIC z.B.#if$_MINSEQE MinSEQe für Sequenzierungsdaten zu verwenden und#endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights-kompatible Formulare für Metaboliten sowie MIAPPE für phänotypische Daten. Letzteres ermöglicht die Integration von Daten über Projekte hinweg und stellt sicher, dass etablierte und getestete Protokolle wiederverwendet werden. Darüber hinaus werden wir Ontologiebegriffe verwenden, um die Datensätze mit freien und offenen Ontologien anzureichern. Zusätzlich könnten zusätzliche Ontologiebegriffe erstellt und während des $_PROJECT kanonisiert werden.
Legitimität
Im Moment erwarten wir keine ethischen oder rechtlichen Probleme beim Datenaustausch. In
Bezug auf Ethik, da es sich um Pflanzendaten handelt, ist kein Ethikkomitee erforderlich,
jedoch wird Sorgfalt bei der Aufteilung der Vorteile von Pflanzenressourcen berücksichtigt.
#issuewarning Sie müssen hier überprüfen und jegliche Sorgfaltspflicht hier eintragen. Im
Moment warten wir, ob Nagoya (🡺siehe Nagoya-Protokoll) auch Teil der Sequenzinformationen
wird. In jedem Fall, wenn Sie Material verwenden, das nicht aus Ihrem (Partner-)Land stammt
und dieses physikalisch charakterisieren, z.B. Metaboliten, Proteom, biochemisch RNASeq
usw., könnte dies eine Nagoya-relevante Aktion darstellen, es sei denn, es stammt z.B. aus
den USA (kein Partner), Irland (nicht unterzeichnet, trotzdem kontaktieren) usw., aber
andere Gesetze könnten gelten…. #endissuewarning
Die einzigen personenbezogenen Daten, die möglicherweise gespeichert werden, sind der Name
und die Zugehörigkeit des Einreichers in den Metadaten der Daten. Darüber hinaus werden
personenbezogene Daten für Verbreitungs- und Kommunikationsaktivitäten gesammelt, wobei
spezifische Methoden und Verfahren verwendet werden, die von den $_PROJECT-Partnern
entwickelt wurden, um den Datenschutz einzuhalten. #issuewarning Sie müssen informieren und
besser eine SCHRIFTLICHE Zustimmung einholen, dass Sie E-Mails und Namen oder sogar
Pseudonyme wie Twitter-Handles speichern, wir entschuldigen uns sehr für diese Probleme, die
wir nicht erfunden haben. #endissuewarning
Data Sharing
Falls Daten nur innerhalb des Konsortiums geteilt werden, wenn die Daten noch nicht fertig
sind oder sich in der IP-Prüfung befinden, werden die Daten intern gehostet und der
Benutzername und das Passwort werden benötigt (siehe auch unsere GDPR-Regeln).
Wenn Daten unter finalen EU- oder US-Repositorys öffentlich gemacht werden, ist
normalerweise ein vollständig anonymer Zugang erlaubt. Dies ist auch bei ENA der Fall und
beide entsprechen den GDPR-Anforderungen.
Es wird keine Einschränkungen geben, sobald die Daten öffentlich gemacht
werden.
#if$_early Einige Rohdaten werden sofort nach ihrer Erfassung und Verarbeitung öffentlich
gemacht.#endif$_early #if$_beforepublication Relevante verarbeitete Datensätze werden
öffentlich gemacht, wenn die Forschungsergebnisse veröffentlicht
werden.#endif$_beforepublication #if$_endofproject Am Ende des Projekts werden alle Daten
ohne Sperrfrist veröffentlicht.#endif$_endofproject #if$_embargo Daten, die einer Sperrfrist
unterliegen, sind bis zum Ende der Sperrfrist nicht öffentlich zugänglich.#endif$_embargo
#if$_request Daten werden auf Anfrage verfügbar gemacht, was eine kontrollierte Weitergabe
ermöglicht und gleichzeitig eine verantwortungsvolle Nutzung sicherstellt.#endif$_request
#if$_ipissue IP-Probleme werden vor der Veröffentlichung überprüft. #endif$_ipissue Alle
Konsortialpartner werden ermutigt,
Daten vor der Veröffentlichung zugänglich zu machen, offen und/oder unter
Vorveröffentlichungsvereinbarungen #if$_GENOMIC wie die in Fort Lauderdale gestarteten und
durch den Toronto International Data Release Workshop festgelegten Vereinbarungen.
#endif$_GENOMIC Dies wird umgesetzt, sobald die IP-bezogenen Überprüfungen abgeschlossen
sind.
Die Daten werden zunächst den $_PROJECT Partnern zugutekommen, aber auch ausgewählten
Stakeholdern, die eng in das Projekt eingebunden sind, und dann der wissenschaftlichen
Gemeinschaft, die an $_STUDYOBJECT arbeitet. $_DATAUTILITY Darüber hinaus können auch die
allgemeine Öffentlichkeit, die an $_STUDYOBJECT interessiert ist, die Daten nach der
Veröffentlichung nutzen. Die Daten werden gemäß dem Verbreitungs- und Kommunikationsplan des
$_PROJECT verbreitet, #if$_DATAPLANT der sich mit der DataPLANT-Plattform oder anderen
Mitteln abstimmt #endif$_DATAPLANT.
Datenerhalt
Wir erwarten, dass wir Rohdaten im Bereich von $_RAWDATA GB an Daten
generieren. Die Größe der abgeleiteten Daten wird etwa $_DERIVEDDATA GB betragen.
#if$_DATAPLANT Da das $_PROJECT eng mit DataPLANT abgestimmt ist, werden der ARC-Konverter und
DataHUB verwendet, um die Endpunkt-Repositories zu finden und die Daten automatisch in die
Repositories hochzuladen. #endif$_DATAPLANT
Die Daten werden über die $_PROJECT-Plattform mit einer benutzerfreundlichen Oberfläche
verfügbar gemacht, die eine Datenvisualisierung ermöglicht. Die Endpunkt-Repositories sind:
#if$_GENETIC #if$_GENBANK NCBI-GenBank,#endif$_GENBANK
#if$_ENA EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#endif$_GENETIC #if$_TRANSCRIPTOMIC|$_GENETIC #if$_SRA NCBI-SRA,#endif$_SRA #if$_GEO
NCBI-GEO,#endif$_GEO #endif$_TRANSCRIPTOMIC|$_GENETIC #if$_TRANSCRIPTOMIC #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #endif$_TRANSCRIPTOMIC #if$_IMAGE #if$_BIOIMAGE
EBI-BioImage Archive,#endif$_BIOIMAGE #if$_IDR IDR (Image Data Resource),#endif$_IDR
#endif$_IMAGE #if$_METABOLOMIC #if$_METABOLIGHTS EBI-MetaboLights,#endif$_METABOLIGHTS
#if$_METAWORKBENCH Metabolomics Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact
(Molecular interactions),#endif$_INTACT #endif$_METABOLOMIC #if$_PROTEOMIC #if$_PRIDE
EBI-PRIDE,#endif$_PRIDE #if$_PDB PDB (Protein Data Bank archive),#endif$_PDB #if$_CHEBI
Chebi (Chemical Entities of Biological Interest),#endif$_CHEBI #endif$_PROTEOMIC #if$_PHENOTYPIC #if$_edal e!DAL-PGP (Plant Genomics & Phenomics Research Data Repository)
#endif$_edal #endif$_PHENOTYPIC
#if$_OTHEREP und $_OTHEREP werden auch verwendet, um Daten zu speichern und die Daten werden
dort ebenfalls verarbeitet.#endif$_OTHEREP
Die Einreichung ist kostenlos, und es ist das Ziel (zumindest von ENA), so viele Daten wie
möglich zu erhalten. Daher sind Absprachen weder notwendig noch sinnvoll.
Catch-all-Repositories sind nicht erforderlich.
#if$_DATAPLANT Für DataPLANT wurde dies vereinbart. #endif$_DATAPLANT #issuewarning Wenn
keine Datenmanagementplattform wie DataPLANT verwendet wird, müssen Sie ein geeignetes
Repository finden, um Ihre Daten nach der Veröffentlichung zu speichern oder zu archivieren.
#endissuewarning
Data will be made available via the $_PROJECT platform using a user-friendly front end
that allows data visualization. Besides this it will be ensured that data which can be
stored in
international discipline related repositories which use specialized technologies:
#if$_GENETIC For genetic data: #if$_GENBANK
NCBI-GenBank,#endif$_GENBANK #if$_SRA NCBI-SRA,#endif$_SRA #if$_ENA
EBI-ENA,#endif$_ENA #if$_ARRAYEXPRESS EBI-ArrayExpress,#endif$_ARRAYEXPRESS
#if$_GEO NCBI-GEO,#endif$_GEO #endif$_GENETIC
#if$_TRANSCRIPTOMIC For Transcriptomic data: #if$_SRA
NCBI-SRA,#endif$_SRA #if$_GEO NCBI-GEO,#endif$_GEO #if$_ARRAYEXPRESS
EBI-ArrayExpress,#endif$_ARRAYEXPRESS #endif$_TRANSCRIPTOMIC
#if$_IMAGE For image data: #if$_BIOIMAGE EBI-BioImage
Archive,#endif$_BIOIMAGE #if$_IDR IDR (Image Data Resource),#endif$_IDR
#endif$_IMAGE
#if$_METABOLOMIC For metabolomic data: #if$_METABOLIGHTS
EBI-MetaboLights,#endif$_METABOLIGHTS #if$_METAWORKBENCH Metabolomics
Workbench,#endif$_METAWORKBENCH #if$_INTACT Intact (Molecular
interactions),#endif$_INTACT #endif$_METABOLOMIC
#if$_PROTEOMIC For proteomics data: #if$_PRIDE
EBI-PRIDE,#endif$_PRIDE #if$_PDB PDB (Protein Data Bank archive),#endif$_PDB
#if$_CHEBI Chebi (Chemical Entities of Biological Interest),#endif$_CHEBI
#endif$_PROTEOMIC
#if$_PHENOTYPIC For phenotypic data: #if$_edal e!DAL-PGP (Plant
Genomics & Phenomics Research Data Repository) #endif$_edal #endif$_PHENOTYPIC