CAGI Scientist Opening: apply here ☀️

Predict patients’ clinical descriptions and pathogenic variants from gene panel sequences

Challenge: ID Panel

Dataset description: public

Dataset availability: encrypted

Category: clinical sequence

Last updated: 24 April 2018

This challenge closes at 8:00 PM PST (Pacific Standard Time) on 30 April 2018.

This challenge is closed - encrypted data is not available.

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.

Summary 

The challenge presented here is to use computational methods to predict a patient’s clinical phenotype and the causal variant(s) based on analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or Autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental presentations (ID, autism, epilepsy, etc..) have been made available for this challenge. For each patient, predictors must report the causative variants and which of seven phenotypes are present.

Background 

When a patient’s gene is sequenced and consistent known disease causing variants are identified, analysis is straightforward. However, in many cases a patient will have one or more novel variants that have never been detected before. The classification of novel DNA variants in a gene is a difficult and incompletely solved problem. In the most simplified paradigm, variants would be classified as either: Pathogenic (disease causing), Uncertain (may or may not contribute to disease), or Not Pathogenic (unlikely to contribute to disease). Care must be taken in the interpretation of variants as these results are reported back to the physician and can alter a patient’s medical care. As the number of genes sequenced per patient increases, the number of novel, unclassified variants also increases. Clinical molecular geneticists must determine which variants, if any, are likely to contribute to the patient’s clinical presentation. The current gold standards for assessing a variant’s pathogenicity are assortment of the variant with the clinical phenotype in multiple pedigrees and functional assays demonstrating a detrimental effect of that specific nucleotide change. In most instances, when a novel genetic variant is identified there is no rapid and reliable method to assess its pathogenicity. Development of improved computational methods to increase the confidence and accuracy of variant classification tools is a high priority in clinical genetics. 

Padua NDD Diagnostic Laboratory Neuro-developmental disorders (NDDs) are common clinically diverse conditions that are genetically heterogeneous. They are often characterized by significant comorbidities such as intellectual disability (ID), Autism Spectrum Disorders (ASD), and epilepsy. Recent studies reported a co-occurrence of ASD with ID (~70%), and epilepsy (~25%), suggesting shared etiologies among these conditions.

The Padua gene panel for Neurodevelopmental disorders was developed with the aim of studying the comorbidities between intellectual disability (ID) and Autism spectrum disorders (ASD). The ID-ASD gene panel is recommended as a genetic test within the Molecular Genetics Laboratory for Neurodevelomental Disorders of the Paediatric Clinic within the University Hospital of Padua. Dr. Emanuela Leonardi oversees the research project. The Lab director Prof. Alessandra Murgia is an expert on genetics of X-Fragile syndrome.

For every patient, a phenotypic description has been detailed. Prior to undergoing the gene panel, all patients had a negative CGH array and a fragile-X test, respectively.

The Padua Diagnostic Laboratory has provided VCF files containing sequence data from 146 patients with a range of clinical presentations (approximately 50% and 35% of the patients have been diagnosed with ASD and epilepsy respectively) for which physicians ordered genetic testing. A custom Ion Ampliseq (Ion Torrent, Carlsbad, C) panel was used to generate target amplicon libraries covering exons and exon-adjacent sequences for 74 genes. Sequence data were processed with the Ion Torrent Suite v 5.0 and called variants were annotated with wANNOVAR.

Further information on sequence data processing is available in the headers of the VCF files. The genetic disorders associated with these 74 genes have been grouped into 7 ‘phenotypic traits’: intellectual disability, autism spectrum disorder, epilepsy, microcephaly, macrocephaly or ataxic gait. The ‘phenotypic traits’ for each patient are based directly on information provided by the patient’s physician. Note that each patient can have more than one ‘phenotypic trait’. Causative genes and variants, when identified, are based on the interpretation by the Padua Diagnostic Laboratory.

Prediction challenge

In this challenge predictors are asked to analyze the sequence data for the 74 genes and to predict the ‘phenotypic traits’ of the patient. Additionally, predictors are asked to report the predicted causal variant(s) in each case responsible for each trait thought to be present. (The predicted causal gene(s) are implied by the predicted causal variants.) The submitted prediction should be the probability (0-1; 0 = no disease, 1 = disease) that the patient has a referring phenotypic trait among the 7 ‘phenotypic traits’ in the provided list, and the predicted causal variant(s) from the gene panel sequence dataset. Each predicted probability of phenotypic trait must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of phenotypic trait will be assessed against the information provided by each patient’s physician. The predicted causal variant(s) will be compared to interpretations from the clinical laboratory, but as these are not independently verified, the predictions of causal variants cannot be rigorously assessed.

Prediction submission format 

Given a mutation ID, predict the ΔΨ value in HepG2 cells. The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. Each data row in the submitted file must include the following columns:

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

For each patient, give the probability of each phenotypic trait and the predicted causal variant(s) responsible for the patient clinical condition. In the submitted file, each row includes the following columns:

In the template file, cells in columns 2-29 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly. Patients (rows) and phenotypic traits (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment.

Note that each patient may have more than one phenotypic trait, and all have at least one.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Additional information

The gene capture regions used in sequencing are provided as bed files.

Related Challenges

Hopkins challenge

Sick Kids CAGI4 challenge

Sick Kids CAGI5 challenge

Download dataset 

Download gene panel sequence dataset: This dataset file is available only to registered users. Please log in to access the file.

Download disorder class descriptions: This descriptions file is available only to registered users. Please log in to access the file.

Download genes: This descriptions file is available only to registered users. Please log in to access the file.

Download genes bed file: This descriptions file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Dataset provided by 

Dr. Emanuela Leonardi, Prof. Alessandra Murgia, Neurodevelopmental Molecular Genetics Laboratory, Department of Women’s and Children’s Health, University of Padua – Hospital of Padua.

Revision history 

22 Dec 2017 (v01): initial release 

28 Dec 2017 (v02): fixed download 

24 Sep 2018 (v03): Dataset availability added