Description
The goal of this document is to provide what information is required to make a PFR generated genome assembly fit for purpose to ensure that the downstream analysts are able to easily understand the source and meaning of the files that are available to them. The principal aim is enhanced collaboration, publication and error minimisation through this improved documentation and specification.
What does"fit for purpose"?
What is "fit for purpose" mean in this document? A downstream analyst, scientist or reviewer must be able to use the data that is presented in a reasonable timeframe and with an acceptable error model. Errors often creep in from misunderstandings which are greatly minimised if all parties know the expectations.
What does Good Look Like
- Release process automated
- Standards followed
- Issues can be raised and captured to enable a reiterative, collaborative process
- Additional effort not required for publication
Roles and Responsibilities
- SGL: Approval
- Assembly Steward / Gatekeeper: Quality assurance
- Bioinformatician: Data generation
Publication Levels
A release is the publication of a genome assembly for analytical use. Any publication that uses a genome assembly must refer to a specific release version. Releases may have different degrees of public access. These are defined as:
- PFR private (e.g. customer relationship)
- PFR intern (e.g. pre-releases or species that don't have publication approval)
- Open access
Naming Convention for a Release
Directories to store data in
Breeding plants:
- Full name: <species[_cultivar]>/<assembly_build>
Bacteria:
- Full name: <species>_<strand>/<assembly_build>
Fungi:
- Full name: <species>_<strand>/<assembly_build
Vertebrate:
• Full name: <species>_<population of origin>/<assembly_build>
Pre-Releases
Pre-releases follow this standard. The assembly build tag for a pre-release is 'pre' and can contain a version number (e.g. Actinidia_chinensis_Russel_v2.1_pre). A pre-release can contain a subset of a full release data file set.
Release Platforms
- Ensembl (local/public for genome representation and visualisation/analyses)
- Webapollo (for user feedback on gene annotations)
- Directory structure on powerPlant (/output for assembly and /input for raw data used for the assembly)
- In-house BLAST server
- Gene models in BioView databases
- IGV (optional)
- Geneious archive (optional)
Genome Assembly Data
A release will comprise the following data:
- Assembly Sequences (contigs, scaffolds, super scaffolds/chromosomes)
- Assembly types (Genome, Transcriptome, Transcripts)
- Link to raw sequence data used for generating the assembly
- Features/Annotations
- Coordinate translations (where available/existing)
- Patches (where available/existing)
- Alignments (where available/existing)
- Indices (bwa, samtools, ...)
- Registry / meta data including the naming of the assembly according to the terminology published by Chain et al in Genome Project Standards in a New Era of Sequencing
- Release history? (assembly_build and release date)
- Metrics summaries as files in output as well as in Git (Metrics from Assemblathon2's assemblaton_stats.pl in preference to faStats from cndsrc, assessment of correctness and completeness - DNASeq mapback rates, RNASeq mapback rates, BUSCO; Accuracy check with REAPR or maybe also Mark Fiers' Hagfish boxplots of largest MPE library for chromosomes; Contamination check using Kraken, Kaiju or TUIT)
Data Files
The data for an assembly will be stored in the following data file formats:
- FASTA contigs, scaffolds, superscaffolds/chromosome (bgzip compressed)
- FASTQ Read files RAW data (bgzip compressed)
- BAM alignment and corresponding index file
- AGP Coordinate system translation
- GFF3 Feature Format File: features, liftovers
- YAML Data registry metadata file data-registry doc
- TAB patches
- Common indices (bwa, bowtie2, blast, samtools)
- Patch files, patch scripts
- BSGenome packages - see a fine forge-script
The minimum requirement for data files is: