- Assembly Sequences (contigs, scaffolds, super scaffolds/chromosomes)
- Assembly types (Genome, Transcriptome, Transcripts)
- Link to Raw sequence data used for generating the assembly
- Features/Annotations
- Coordinate translations (where available/existing)
- Patches (where available/existing)
- Alignments (where available/existing)
- Indices (bwa, samtools, ...)
- Registry / meta data including the naming of the assembly according to the terminology published by Chain et al in Genome Project Standards in a New Era of Sequencing
- Release history? (assembly_build and release date)
- Metrics summaries as files in output as well as in Git (Metrics from Assemblathon2's assemblaton_stats.pl in preference to faStats from cndsrc, assessment of correctness and completeness - DNASeq mapback rates, RNASeq mapback rates, BUSCO; Accuracy check with REAPR or maybe also Mark Fiers' Hagfish boxplots of largest MPE library for chromosomes; Contamination check using Kraken, Kaiju or TUIT)
Data Files
The data for an assembly will be stored in the following data file formats:
- FASTA contigs, scaffolds, superscaffolds/chromosome (compressed)
- FASTQ Read files RAW data (compressed)
- BAM alignment and corresponding index file
- AGP Coordinate system translation
- GFF3 Feature Format File: features, liftovers
- YAML Data registry metadata file data-registry doc
- TAB patches
- Common indices (bwa, bowtie2, blast, samtools)
- Patch files, patch scripts
- BSGenome packages - see a fine forge-script
The minimum requirement for data files is: