CAGI Scientist Opening: apply here ☀️

Predict patients’ clinical descriptions and pathogenic variants from their genome sequences

Challenge: SickKids clinical genomes

Dataset description: public

Genome sequence data: registered users only, limited by CAGI Data Use Agreement

Last updated: 12 April 2016

This challenge will tentatively close at 9:00 PM PST (Pacific Standard Time) on 1 February 2016.


The answers to the matching portion of this challenge have been inadvertently published. Supplementary Table 4 of Stavropoulos, D.J., et al. (2016) describes the phenotypes of the 25 individuals in the challenge, using the same IDs as in the genome sequence files provided below. On that basis, the genomes can be matched to the cases. Thus, the major part of the challenge is spoiled.

We will continue with the “soft” challenge of identifying diagnostic variants and predictive secondary variants. This will be “soft” because there are no agreed answers, so these cannot be effectively assessed. We suggest that participants make use of the newly published information to better focus their analysis on likely genes of interest in each case. To facilitate this change in the challenge, we provide the answers to the matching challenge. In your prediction submission, please use the answer key provided to enter the correct matching information into the template with probability = 1 and standard deviation = 0 for the correct matches and leave “*” in all fields for the incorrect matches.

While it is a pity the matching portion of the challenge will not be available, the main scientific and clinical relevance of the challenge very much remains. We hope that the results will lead to a productive exchange on what can be done about these difficult, unsolved cases. The data providers may also provide feedback on proposed variants. In particular, views on the relationship between the results and what constitutes sufficient evidence for a variant to be deemed pathogenic in the clinical sense, a major issue in the diagnostic lab. The secondary predictive variants portion of the challenge will be similarly useful. 

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.Summary 

Realizing the promise of precision medicine will require developing methods for interpreting genome sequence data to infer individuals’ phenotypic traits and predispositions to disease. This challenge involves 25 children with suspected genetic disorders who were referred for clinical genome sequencing. Predictors are given their genome sequences and their clinical phenotypic descriptions, as provided to the diagnostic laboratory, and asked to predict which genome corresponds to which clinical description. Additionally, identify the diagnostic variants underlying the predictions. Optionally, identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions.


Genomic medicine is expected to transform health care by using individualized genomic information to predict, diagnose and manage disease. The success of precision genomic medicine is dependent on rapid, comprehensive, and accurate assessment of the clinical significance of an individual’s genomic variants. However, our current ability to translate genomic data into clinic knowledge is quite limited, in large part due to our inability to accurately interpret novel sequence variants as well as the widespread misclassification of known variants. Consequently, clinical assessments of sequence data currently yield more variants of unknown significance (VUS) than causal variants. The difficulties are particularly acute when trying to identify medically significant predictive (secondary) variants, where the prior probability of clinical significance is lower.

Whole exome sequencing (WES) has been widely adopted for hypothesis-free testing of patients clinically suspected of a genomic disorder, yet the majority of these patients remain undiagnosed. As sequencing costs have dropped, whole genome sequencing (WGS) has become an attractive alternative to WES, as WGS offers the possibility of increased diagnostic yield, due to a much more comprehensive assessment of potentially causal variants. However, existing computational methods and databases effectively limit clinical analysis of WGS data to the equivalent of a virtual WES and whole genome microarray.

The SickKids Genome Clinic 

To pilot the implementation of genomic medicine in pediatrics, we have developed the SickKids Genome Clinic, a multidisciplinary test bed that supports a wide range of research into the clinical uses of WGS, from development of new bioinformatics pipelines and counseling models to health economics and bioethics studies.

This project treats an individual’s genome as a source of genetic information to be repeatedly queried over time in order to manage overall health, rather than a one time test performed to diagnose a pre-existing disorder. We recruit children who are undergoing diagnostic molecular testing for a suspected genetic disorder, with DNA and RNA collected from the child and both parents. Participating children undergo WGS after their parents are counseled about WGS and have declared their preferences for learning their child's pharmacogenetic variants, adult-onset secondary (incidental) medically-actionable variants and carrier status variants.

Prediction challenge

The challenge presented here is to use computational methods to match each genome to the corresponding clinical phenotypic description, and to identify those genomic variants which contribute to the known phenotype (diagnostic variants), as well as other genomic variants which are likely to confer a high risk of other disease (predictive secondary variants). The SickKids Genome Clinic is providing clinical phenotypic information and whole genome sequence data for 25 cases that remain unsolved after initial screening for plausible coding, splicing, and structural variants. In addition, some patients have one or more pathogenic secondary predictive variants in known disease genes that are being returned to the patients because of their potential clinical significance.

To model the clinical testing environment, the clinical phenotypic data are those that accompanied the initial sequencing request and do not necessarily capture the full range of phenotypic abnormalities that might be obtained upon a comprehensive, genotype-driven reassessment of the patient. Phenotypic data were provided by the referring clinical geneticist and entered into Phenotips, a Human Phenotype Ontology-based database. The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in children referred for clinical microarray analysis at SickKids.

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by Complete Genomics, Inc. (Mountain View, CA). Raw sequence reads were reassembled against a reference genome (GRCh37) and variant calling was completed using Complete Genomics assembly pipeline 2.5. Sequence results consist of variant calls in the form of (i) single nucleotide variants (SNV) and small insertion/deletions (InDels), (ii) structural variants (based on abnormal junction and discordant mate pair clusters, with size typically 50-75,000 bp), and (iii) copy number variants (based on normalized sequencing coverage, with size typically greater than 2,000 bp). A description of the files in the genome sequence dataset will be provided soon. Participants are asked to submit predictions of the clinical phenotypic description that matches each genome, and to report the genomic variant(s) underlying each prediction (i.e., diagnostic variants). In addition, report secondary variants predicted to confer high risk of other disorders not present in the clinical phenotypic description (i.e., predictive secondary variants). The submitted prediction should be the probability (0-1; 0 = no match, 1 = match) that a genome sequence matches a clinical phenotypic description, the diagnostic variant(s), and the predictive secondary variant(s). Each predicted probability of a match must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of genotype-phenotype matches will be assessed against the information provided by the referring clinical geneticists. The diagnostic and predictive secondary variant(s) will be compared to interpretations from the clinical laboratory, but these are not known with certainty, so the predictions of variants cannot be rigorously assessed.

Download genome sequence dataset: This dataset file is available only to registered users. Please log in to access the file. The file structure of the sequence dataset is described HERE

Download clinical phenotypic descriptions: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download submission validation script: This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

For each genome sequence, give the probability of a match to each clinical phenotypic description and list the predicted diagnostic variant(s) supporting the match to that description. Optionally, also list secondary variant(s) predicted to confer high risk of other disorders not reported in the clinical phenotypic description. In the tab-delimited submission template provided, the first column lists the genome sequence identifiers, and the first row lists the clinical phenotypic description identifiers followed by P (for probability), SD (for standard deviation), DV (for diagnostic variant(s)), PSV (for predictive secondary variant(s)), or C (for comment). Each "*" should be replaced by a probability P of a match between the genome sequence and the clinical phenotypic description (real number, 0 to 1; 0 = no match, 1 = match), a standard deviation SD indicating the confidence of the prediction, one or more predicted diagnostic variants DV from the genome sequence dataset, optionally one or more predictive secondary variants PSV conferring high risk of other disorders, and an optional comment C on the basis of the predictions (otherwise leave the “*” in the optional fields). Variants should be reported within a single DV (diagnostic variant) or PSV (predictive secondary variant) field in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated. Genomes (rows) and clinical phenotypic descriptions (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment. Note that every genome sequence matches exactly one clinical phenotypic description. Predictions of predictive secondary variants PSV and comments C are optional.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Data provided by

Stephen Meyn, Christian Marshall, D. James Stavropoulos, Sarah Bowdin and Nasim Monfarad, on behalf of the SickKids Genome Clinic, a project of the Centre for Genetic Medicine of the Hospital for Sick Children, Toronto.


22 Dec 2015 (v01): initial release 

23 Dec 2015 (v02): genome sequence dataset provided 

4 Jan 2016 (v03): description of the dataset files provided

8 Jan 2016 (v04): submission template and validation script provided

21 Jan 2016 (v05): matching portion spoiled, answer key provided

18 Mar 2016 (v06): predictions provided 

12 Apr 2016 (v07): presentations provided