ASM file structure and definitions
The following is a description of the file structure of the sequence dataset for the SickKids clinical genomes challenge. Further information on Complete Genomics file structure is available here.
CGI files and Structure
ASM directory:
The files in the ASM directory describe and annotate the genome assembly with respect to the reference genome. It should contain these files:
CAGI The total space would typically be ~35Gb/genome but we did not transfer the REF and the EVIDENCE folders (which are very large) and thus the files for each genome should be ~2Gb/genome.
The files in the ASM directory describe and annotate the sample’s genome assembly with respect to the reference genome, including:
- Variations: The primary results of the assembly describing variant and non-variant alleles found. Called variants in this file are cross-referenced with entries in dbSNP and the Catalogue of Somatic Mutation in Cancer (COSMIC).
- Master Variations: Results of the assembly describing variant and non-variant alleles found, with annotation information in a one-line-per-locus format. This file is derived from the Variations file and is a simple, integrated report of the variant calls and annotation information.
- vcfBeta: Results of small variant, CNV, MEI, and SV detection with scores and annotations in VCF format. This file contains the small variant, copy number variation (CNV), structural variation (SV), and mobile element insertion (MEI) calls made by the Complete Genomics Assembly Pipeline for a single genome. It conforms to the VCF 4.1 specification.
- Genes: Annotated variants within known protein coding genes.
- ncRNAs: Annotated variants within non-coding RNAs
- Gene Variation Summary: Count of variants in known genes.
- DB SNP: Variations in known dbSNP loci.
- Variations and Annotations Summary: Statistics of sequence data to assess genome quality.
In addition to the variations file, the ASM directory includes annotations of the assembled sequence with respect to the SNP database (dbSNP), RefSeq transcripts, and protein sequences. These folders are not large and only take up negligible space. The ASM directory includes the following subdirectories:
- CNV (~15Mb) - Files containing the segmentation of reference genome into regions of distinct ploidy. For normal genomes, ploidy, coverage, confidence scores, and annotations are reported for each segment.
- SV (~25Mb) - Files containing detected junctions, supporting evidence DNB mappings, and associated annotations, including coordinates of breakpoint, putative structural variation size, confidence scores, and overlap with genomic elements.
- MEI (~2Mb) - Files containing detected mobile element insertions and associated annotations, including event type, count of DNBs supporting event versus reference, confidence score for called event type, and overlap with genomic elements.
- REPORTS (~10Mb) Files containing information that can be used to assess quality and characteristics of the sequenced genome, including distribution of coverage, coverage by GC content, and size of called indels and substitutions, genome-wide and in coding region.