Predict patients’ clinical descriptions and pathogenic variants from their genome sequences

Challenge: SickKids clinical genomes
Dataset description: public
Genome sequence data: registered users only, limited by the CAGI Data use agreement
Last updated: 17 Jan 2017.
This challenge tentatively close at 9:00 PM PST (Pacific Standard Time) on TBD.

[Summary] [Background] [Experiment] [Prediction Challenge] [Prediction Submission Format] [References] [Revision history]

Summary
This challenge involves 24 children who were referred for clinical genome sequencing with suspected genetic disorders in three classes: 6 eye disorders, 7 neurogenetic diseases, and 11 connective tissue disorders (including EhlersDanlos). Predictors are given the 24 genome sequences. Predictors are also separately provided with the phenotypic descriptions as shared with the diagnostic laboratory. The challenge is to predict what class of disease is associated with each genome, and which genome corresponds to which specific clinical description. Predictors may additionally identify the diagnostic variant(s) underlying the predictions, and identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions.

Background
Our current ability to translate genomic data into clinic knowledge remains limited, due to our inability to accurately interpret novel sequence variants as well as the widespread misclassification of known variants. Consequently, clinical assessments of sequence data currently yield more variants of uncertain significance (VUS) than causal variants. The difficulties are particularly acute when trying to identify medically significant predictive (secondary) variants, where the prior probability of clinical significance is lower.

Whole exome sequencing (WES) has been widely adopted for testing of patients clinically suspected of a genomic disorder, yet the majority of these patients remain undiagnosed. As sequencing costs have dropped, whole genome sequencing (WGS) has become an attractive alternative to WES, as WGS offers the possibility of increased diagnostic yield, due to a much more comprehensive assessment of potentially causal variants. However, existing computational methods and databases effectively limit clinical analysis of WGS data to the equivalent of a virtual WES and whole genome microarray.

A similar challenge was included in CAGI4 (https://genomeinterpretation.org/content/4-SickKids_clinical_genomes) in which predictors had to predict patients’ clinical descriptions and pathogenic variants for 25 children with undiagnosed disease. Results from this challenge showed that CAGI predictors were able to identify causal variants that were overlooked in the initial clinical pipeline analysis. Some of the variants indicated by CAGI predictors were validated with the clinician.

The SickKids Genome Clinic
To pilot the implementation of genomic medicine in pediatrics, Dr. Meyn and colleagues developed the SickKids Genome Clinic, a multidisciplinary test bed that supports a wide range of research into the clinical uses of WGS, from development of new bioinformatics pipelines and counselling models to health economics and bioethics studies.

This project treats an individual’s genome as a source of genetic information to be repeatedly queried over time in order to manage overall health, rather than a one time test performed to diagnose a pre-existing disorder. We recruit children who are undergoing diagnostic molecular testing for a suspected genetic disorder, with DNA and RNA collected from the child and both parents. Participating children undergo WGS after their parents are counselled about WGS and have declared their preferences for learning their child's pharmacogenetic variants, adult-onset secondary (incidental) medically-actionable variants and carrier status variants.

Prediction challenge
The challenge presented here is to use computational methods to predict what class of disease is associated with each genome, and also which genome corresponds to which specific clinical description. Predictors may additionally identify the diagnostic variant(s) underlying the predictions, and identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions (predictive secondary variants). The SickKids Genome Clinic is providing clinical phenotypic information and whole genome sequence data for 24 cases that remain unsolved after initial screening for plausible coding, splicing, and structural variants. In addition, some patients have one or more pathogenic secondary predictive variants in known disease genes that are being returned to the patients because of their potential clinical significance.

To model the clinical testing environment, the clinical phenotypic data are those that accompanied the initial sequencing request and do not necessarily capture the full range of phenotypic abnormalities that might be obtained upon a comprehensive, genotype-driven reassessment of the patient. Phenotypic data were provided by the referring clinical geneticist and entered into Phenotips, a Human Phenotype Ontology-based database. The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in children referred for clinical microarray analysis at SickKids.

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by Complete Genomics, Inc. (Mountain View, CA). Raw sequence reads were reassembled against a reference genome (GRCh37) and variant calling was completed using Complete Genomics assembly pipeline 2.5. Sequence results consist of variant calls in the form of (i) single nucleotide variants (SNV) and small insertion/deletions (InDels), (ii) structural variants (based on abnormal junction and discordant mate pair clusters, with size typically 50-75,000 bp), and (iii) copy number variants (based on normalized sequencing coverage, with size typically greater than 2,000 bp). A description of the files in the genome sequence dataset will be provided soon.

Participants are asked to submit predictions of the broad class of disease and the more specific clinical description that matches each genome. In addition, they may report the genomic variant(s) underlying each prediction (i.e., diagnostic variants) and/or secondary variants predicted to confer high risk of other disorders not present in the clinical phenotypic description (i.e., predictive secondary variants).

The submitted prediction should be the probability (0-1; 0 = no match, 1 = match) that a genome sequence matches each broad and specific clinical phenotypic description, the diagnostic variant(s), and the predictive secondary variant(s). Each predicted probability of a match must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of genotype-phenotype matches will be assessed against the information provided by the referring clinical geneticists. The diagnostic and predictive secondary variant(s) will be compared to interpretations from the clinical laboratory, but these are not known with certainty, so the predictions of variants cannot be rigorously assessed.

Related Challenges
Sick Kids4

 

Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

 

  • ID - genome sequence identifier
  • Eye-P - Disease class: Probability (0-1; 0 = no match, 1 = match) that a genome sequence matches the eye disorders disease class
  • Eye-SD – standard deviation of the previous column indicating the confidence of the prediction
  • Neuro-P - Disease class: Probability (0-1; 0 = no match, 1 = match) that a genome sequence matches the neurogenic diseases class
  • Neuro-SD – standard deviation of the previous column indicating the confidence of the prediction
  • Connective-P - Disease class: Probability (0-1; 0 = no match, 1 = match) that a genome sequence matches the connective tissue diseases class
  • Connective-SD – standard deviation of the previous column indicating the confidence of the prediction
  • Phenotypes-P - Clinical class: Indicate the probability that the sample matches each of the clinical phenotypic descriptions. Provide a list of 24 values separated by commas to reflect probability of match to the corresponding description. The 24 values must sum to 1.0.
  • Phenotypes-SD - Indicate the standard deviation indicating the confidence of the prediction. Provide a list of 24 values separated by commas to reflect confidence of the match to the corresponding description.
  • DV- indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated.
  • PSV - indicate predictive secondary variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated.
  • Comment – optional comment on the basis of the predictions as well as comments on DV and PSV columns.

 

In the template file, cells in columns are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a probability for each pathogenicity class for every mutant. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

INote that every genome sequence matches exactly one clinical phenotypic description. Predictions of predictive secondary variants PSV and comments C are optional.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Download genome sequence dataset
This dataset file is available only to registered users. Please log in to access the file.
The file structure of the sequence dataset is described HERE
Download clinical phenotypic descriptions
This dataset file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download submission validation script
This submission validation script is available only to registered users. Please log in to access the file.

Data provided by
meyn Marshall Stavropoulos Bowdin Monfared
Stephen Meyn, Christian Marshall, D. James Stavropoulos, Sarah Bowdin and Nasim Monfarad, on behalf of the SickKids Genome Clinic, a project of the Centre for Genetic Medicine of the Hospital for Sick Children, Toronto.

Revision history
22 Dec 2017 (v01): initial release
9 Jan 2018 (v02): Link to VCF files updated