Predict molecular events underlying disease from a patient’s genome and transcriptome

Challenge: SickKids clinical genomes and transcriptomes

Genotype data: encrypted, registered users only

Last updated: 13 August 2021

This challenge is open. This challenge will close on 30 September 15 October 2021.

How to participate in CAGI6?                         Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


This challenge involves data from 79 children who were referred to The Hospital for Sick Children’s (SickKids) Genome Clinic for genome sequencing because of suspected but undiagnosed genetic disorders. Research subjects are consented for sharing of their sequence data and phenotype information with researchers working to understand the molecular causes of rare disease. When a candidate disease variant believed to be related to the phenotype is identified, the variant is adjudicated and confirmed in a clinical setting. In this challenge, transcriptomic and phenotype data from a subset of the “solved” (diagnosed) and “unsolved” SickKids patients will be provided, along with corresponding genomic sequence data. The challenge is to use a transcriptome-driven approach to identify the gene(s) and molecular mechanisms underlying the phenotypic descriptions in each case. For the unsolved cases, prioritized variants from the participating teams will be examined to see if additional diagnoses can be made based on the predictions from the CAGI community.


One major obstacle facing rare disease patients is simply obtaining a genetic diagnosis (Rehm, 2017). The average “diagnostic odyssey” for rare disease families lasts more than five years, and over 50% of rare disease patients still lack a genetic diagnosis. A well-recognized obstacle to diagnosis is the technical limitations of variant analysis, which determines whether the disease-causing variants can be detected. Here we present transcriptomic data, which provides information on gene expression profiles and splicing variation, along with whole genome sequencing data, which provides coverage of coding and noncoding regions. The focus of this challenge addresses the analytical obstacle to diagnosis and seeks to evaluate the ability of participants to recognize the pathogenic variation within a sea of benign variation. Current standards for variant interpretation have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) and refined by ClinGen (Richards et al., 2015; Abou Tayoun, et al., 2018). Cases for which the cause is due to structural variation are not included in this challenge.

Previous SickKids challenges were included in CAGI4 and CAGI5, in which predictors had to identify pathogenic variants in genome sequence data for respectively 25 and 24 children with undiagnosed disease. Results from the CAGI4 challenge showed that CAGI predictors were able to identify causal variants that were overlooked in the initial clinical pipeline analysis. Some of the variants indicated by CAGI predictors were subsequently validated with the clinician. 

In this challenge, patient blood transcriptome data are included along with the genome sequence data in an attempt to expand the potential for diagnosis by exploring transcriptional alterations caused by genetic variants.

The SickKids Genome Clinic 

To pilot the implementation of genomic medicine in pediatrics, Dr. Meyn and colleagues developed the SickKids Genome Clinic, a multidisciplinary test bed that supports a wide range of research into the clinical uses of WGS, from development of new bioinformatics pipelines and counseling models to health economics and bioethics studies. Later, this research model was expanded to include a cohort of children, with unexplained medical complexity and prior negative conventional genetic testing, who are followed by a structured Complex Care Program. For the purposes of this challenge, we will consider the latter cohort as a branch of the SickKids Genome Clinic project. 

This project treats an individual’s genome as a source of genetic information to be repeatedly queried over time in order to manage overall health, rather than a one-time test performed solely for the purposes of diagnosis. We recruit children who are undergoing diagnostic molecular testing for a suspected genetic disorder, with DNA and RNA collected from the child (and, for the Complex Care Program participants, samples from available biological parents). Participating families undergo WGS and RNA sequencing after parents are counseled about WGS and affirm their preferences for learning secondary information such as their child's pharmacogenetic variants, adult-onset secondary (incidental) medically-actionable variants and carrier status variants.

Prediction challenge

The SickKids Genome Clinic is providing clinical phenotypic information in the form of summary data from Phenotips, a database based on Human Phenotype Ontology (HPO) nomenclature (Robinson & Mundlos, 2010), transcriptomic data, and whole genome sequence for 79 children. Participants in this challenge are asked to provide a molecular diagnosis for as many probands as possible; that is, to prioritize genes, molecular events, and variants that are responsible for the phenotype of each proband.

The set of patients in the challenge consists of both “solved” and “unsolved” cases as determined by the SickKids research team with the solved cases having been confirmed by the local clinician, when possible. The solved cases will be used to evaluate the performance of each challenge participant and include a combination of those solved by RNA or DNA. The cases that remain unsolved after initial screening for plausible coding, splicing, and structural variants, have been included in the challenge with the goal of the CAGI community helping identify new potentially causal molecular events and variants. Top candidates are likely to undergo further experimental and clinical evaluation and may potentially lead to results that will be returned to the subjects. The CAGI organizers are not disclosing the numbers of solved and unsolved cases to allow participants to perform the task in a manner that accurately reflects real-world clinical case analysis.

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by Complete Genomics (Stavropoulos et al. 2016) and by SickKids on an Illumina sequencer to greater than 36x depth (Lionel et al. 2018). Raw sequence reads were reassembled against a reference genome (GRCh37) and variant calling was completed with GATK. Sequence results consist of variant calls in the form of single nucleotide variants (SNVs) and small insertion/deletions (indels) within a joint variant call file (vcf). Structural variants are not included in the dataset, although may be added during the challenge at a later date.

RNA was extracted with an automated QIAsymphony PAXGene blood RNA kit (Qiagen).  RNA-seq libraries were prepared using 250 ng of total RNA spiked with SIRV Set 3 (Lexogen) with automated NEBNext Ultra II Directional RNA library prep kit for Illumina with polyA isolation module on the Agilent NGS workstation.  Library size, quantity and sequencing was performed at The Centre for Applied Genomics (Toronto, Canada) using Agilent Bioanalyzer, KAPA qPCR and Illumina NovaSeq6000.  Libraries were sequenced, on average, to a depth of 100 million paired-end reads with 150 bp read lengths. Raw sequencing reads were aligned to a hybrid genome of hg19 (1000 genomes reference genome, hs37d5) and the spike-in sequences (SIRVome, SIRV set3) using STAR (v2.6.1c). Bam files contain both mapped and unmapped reads. Duplicated reads were marked by picard (v2.18.0) Markduplicates.

Training data 

Participants will be provided with training data from previous SickKids publications. 

Prediction submission format 

The prediction submission is a tab-delimited text file. The CAGI organizers will provide an example submission file. In addition, a validation script will be provided, and predictors must check the correctness of the format before submitting their predictions.

For each family ID, list the impacted gene(s) for each proband. Up to 100 lines can be listed as proposed solutions, along with the estimated strength of association, uncertainty estimate, and molecular event for each proposed solution. Each line should be reported in the following format: 

An example format: 

HGNC symbol;    P;    SD;    Expression;    Splicing;    ASE;    Other;    RNA impact;    Refseq transcript version;    DNA variant;    Additional comments

COL6A2;    0.812;    1.36;    yes;    yes;    no;    no;    decreased expression, skipping of exon 17;    NM_001849.4;    optional;    optional

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file. You can also optionally upload your code.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description for your method(s), use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.


Predictions will be assessed by the independent assessors. It is important to mention that causal variants in the solved cases may not be known with certainty. That is, two different clinical groups could potentially come up with a different diagnosis. This situation will be accounted for in the assessment for both fully computational predictions and other clinical laboratories participating in the challenge.

For the top-performing teams, genomic analysts on the SickKids team will re-review the molecular events and variants for undiagnosed cases to see if they are diagnostic or merit further evaluation.

Ethical considerations

The data in this challenge are derived from patients with rare, in some cases currently undiagnosed, diseases and their parents. Identification of putative pathogenic variants (i.e., causal with respect to the clinical phenotype under investigation) may, if confirmed, be important for tailoring clinical interventions and obtaining social services.

While families participating in the SickKids Genome Clinic project have been explicitly consented with respect to the receipt of genetic results unrelated to the rare condition under investigation, predictors are reminded that identification of pathogenic variants related to adult-onset conditions (such as might be identified as secondary variants in this challenge) in minors remains ethically controversial, though some commentators see value in the identification of such variants in parents.

Related challenges

Data provided by

Stephen Meyn, Christian Marshall, Gregory Costain, Michael Wilson, Lianna G. Kyriakopoulou, Kyoko Yuki and Huayun Hou on behalf of the SickKids Genome Clinic, a project of the Centre for Genetic Medicine of the Hospital for Sick Children, Toronto.


Abou Tayoun AN, et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum Mutat (2018) 39(11):1517-1524. PubMed 

Lionel AC, et al. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet Med (2018) 20(4):435-443. PubMed 

Rehm HL. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4):259-267. PubMed

Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med (2015) 17(5): 405-424. PubMed  

Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet (2010) 77(6): 525-534. PMID: 20412080  PubMed 

Stavropoulos DJ, et al. Whole genome sequencing expands diagnostic utility and improves clinical management in pediatric medicine. NPJ Genom Med (2016) 1:15012. PubMed 

Revision history 

4 August 2021: challenge released

13 August 2021: submission deadline extended to October 15