CAGI Scientist Opening: apply here ☀️

Predict patients’ clinical descriptions and pathogenic variants from their genome sequences

Challenge: SickKids clinical genomes

Dataset availability: encrypted, registered users only

Last updated: 24 April 2018

This challenge is closed.

Make sure you understand our Data Use Agreement and Anonymity Policy


Please be aware that CAGI "C" should actually be a 13-year-old boy with visual problems and CAGI "D" is a 19-year-old, not a 9 year old. Please download the revised phenotypes for patients C and D.  This submission validation script is available only to registered users.


This challenge involves 24 children who were referred for clinical genome sequencing with suspected genetic disorders in three classes: 6 eye disorders, 7 neurogenetic diseases, and 11 connective tissue disorders (including EhlersDanlos). Predictors are given the 24 genome sequences. Predictors are also separately provided with the phenotypic descriptions as shared with the diagnostic laboratory. The challenge is to predict what class of disease is associated with each genome, and which genome corresponds to which specific clinical description. Predictors may additionally identify the diagnostic variant(s) underlying the predictions, and identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions.


Our current ability to translate genomic data into clinic knowledge remains limited, due to our inability to accurately interpret novel sequence variants as well as the widespread misclassification of known variants. Consequently, clinical assessments of sequence data currently yield more variants of uncertain significance (VUS) than causal variants. The difficulties are particularly acute when trying to identify medically significant predictive (secondary) variants, where the prior probability of clinical significance is lower.

Whole exome sequencing (WES) has been widely adopted for testing of patients clinically suspected of a genomic disorder, yet the majority of these patients remain undiagnosed. As sequencing costs have dropped, whole genome sequencing (WGS) has become an attractive alternative to WES, as WGS offers the possibility of increased diagnostic yield, due to a much more comprehensive assessment of potentially causal variants. However, existing computational methods and databases effectively limit clinical analysis of WGS data to the equivalent of a virtual WES and whole genome microarray.

A similar challenge was included in CAGI4 in which predictors had to predict patients’ clinical descriptions and pathogenic variants for 25 children with undiagnosed disease. Results from this challenge showed that CAGI predictors were able to identify causal variants that were overlooked in the initial clinical pipeline analysis. Some of the variants indicated by CAGI predictors were validated with the clinician.

The SickKids Genome Clinic 

To pilot the implementation of genomic medicine in pediatrics, Dr. Meyn and colleagues developed the SickKids Genome Clinic, a multidisciplinary test bed that supports a wide range of research into the clinical uses of WGS, from development of new bioinformatics pipelines and counselling models to health economics and bioethics studies.

This project treats an individual’s genome as a source of genetic information to be repeatedly queried over time in order to manage overall health, rather than a one time test performed to diagnose a pre-existing disorder. We recruit children who are undergoing diagnostic molecular testing for a suspected genetic disorder, with DNA and RNA collected from the child and both parents. Participating children undergo WGS after their parents are counselled about WGS and have declared their preferences for learning their child's pharmacogenetic variants, adult-onset secondary (incidental) medically-actionable variants and carrier status variants.

Prediction challenge

The challenge presented here is to use computational methods to predict what class of disease is associated with each genome, and also which genome corresponds to which specific clinical description. Predictors may additionally identify the diagnostic variant(s) underlying the predictions, and identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions (predictive secondary variants). The SickKids Genome Clinic is providing clinical phenotypic information and whole genome sequence data for 24 cases that remain unsolved after initial screening for plausible coding, splicing, and structural variants. In addition, some patients have one or more pathogenic secondary predictive variants in known disease genes that are being returned to the patients because of their potential clinical significance.

To model the clinical testing environment, the clinical phenotypic data are those that accompanied the initial sequencing request and do not necessarily capture the full range of phenotypic abnormalities that might be obtained upon a comprehensive, genotype-driven reassessment of the patient. Phenotypic data were provided by the referring clinical geneticist and entered into Phenotips, a Human Phenotype Ontology-based database. The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in children referred for clinical microarray analysis at SickKids.

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by Illumina HiSeq. Raw sequence reads were reassembled against a reference genome (GRCh37).

Participants are asked to submit predictions of the broad class of disease and the more specific clinical description that matches each genome. In addition, they may report the genomic variant(s) underlying each prediction (i.e., diagnostic variants) and/or secondary variants predicted to confer high risk of other disorders not present in the clinical phenotypic description (i.e., predictive secondary variants).

The submitted prediction should be the probability (0-1; 0 = no match, 1 = match) that a genome sequence matches each broad and specific clinical phenotypic description, the diagnostic variant(s), and the predictive secondary variant(s). Each predicted probability of a match must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of genotype-phenotype matches will be assessed against the information provided by the referring clinical geneticists. The diagnostic and predictive secondary variant(s) will be compared to interpretations from the clinical laboratory, but these are not known with certainty, so the predictions of variants cannot be rigorously assessed.

Related Challenges 

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

In the template file, cells in columns are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a probability for each pathogenicity class for every mutant. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

Note that every genome sequence matches exactly one clinical phenotypic description. Predictions of predictive secondary variants PSV and comments C are optional.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group.  For more details, please read the FAQ page.

The file structure of the sequence dataset is described here

Data provided by

Stephen Meyn, Christian Marshall, D. James Stavropoulos, Sarah Bowdin and Nasim Monfarad, on behalf of the SickKids Genome Clinic, a project of the Centre for Genetic Medicine of the Hospital for Sick Children, Toronto.

Revision history 

22 December 2017: Initial release 

9 January 2018: Link to VCF files updated 

1 February 2018: Fixed validation script and template 

23 March 2018: New template file and Illumina Hiqseq data info 

16 April 2018: Revised phenotypes and challenge date

26 April 2018: Challenge closed

24 September 2018: Dataset availability added