Predict patients’ clinical descriptions and pathogenic variants from gene panel sequences

Challenge: ID Panel
Dataset description: public
Category: Clinical sequence Last updated: 24 Apr 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 30 Apr 2018.

Although the challenge has closed, late submissions have happened occasionally in CAGI. Our policy is that, out of fairness, these cannot be included in the primary assessment by the assessor. However, the assessor will have access to late submissions and may at their discretion choose to consider them in parallel with the on-time primary submissions. If the assessor chooses to consider them, the results for late submissions will be always labeled as 'late' and kept distinct, but might be mentioned in presentations and in the publication.

[Summary] [Background] [Experiment] [Prediction Challenge] [Prediction Submission Format] [References] [Revision history]


Summary
The challenge presented here is to use computational methods to predict a patient’s clinical phenotype and the causal variant(s) based on analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or Autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental presentations (ID, autism, epilepsy, etc..) have been made available for this challenge. For each patient, predictors must report the causative variants and which of seven phenotypes are present.

Background
When a patient’s gene is sequenced and consistent known disease causing variants are identified, analysis is straightforward. However, in many cases a patient will have one or more novel variants that have never been detected before. The classification of novel DNA variants in a gene is a difficult and incompletely solved problem. In the most simplified paradigm, variants would be classified as either: Pathogenic (disease causing), Uncertain (may or may not contribute to disease), or Not Pathogenic (unlikely to contribute to disease). Care must be taken in the interpretation of variants as these results are reported back to the physician and can alter a patient’s medical care. As the number of genes sequenced per patient increases, the number of novel, unclassified variants also increases. Clinical molecular geneticists must determine which variants, if any, are likely to contribute to the patient’s clinical presentation. The current gold standards for assessing a variant’s pathogenicity are assortment of the variant with the clinical phenotype in multiple pedigrees and functional assays demonstrating a detrimental effect of that specific nucleotide change. In most instances, when a novel genetic variant is identified there is no rapid and reliable method to assess its pathogenicity. Development of improved computational methods to increase the confidence and accuracy of variant classification tools is a high priority in clinical genetics. Padua NDD Diagnostic Laboratory Neuro-developmental disorders (NDDs) are common clinically diverse conditions that are genetically heterogeneous. They are often characterized by significant comorbidities such as intellectual disability (ID), Autism Spectrum Disorders (ASD), and epilepsy. Recent studies reported a co-occurrence of ASD with ID (~70%), and epilepsy (~25%), suggesting shared etiologies among these conditions.

The Padua gene panel for Neurodevelopmental disorders was developed with the aim of studying the comorbidities between intellectual disability (ID) and Autism spectrum disorders (ASD). The ID-ASD gene panel is recommended as a genetic test within the Molecular Genetics Laboratory for Neurodevelomental Disorders of the Paediatric Clinic within the University Hospital of Padua. Dr. Emanuela Leonardi oversees the research project. The Lab director Prof. Alessandra Murgia is an expert on genetics of X-Fragile syndrome.

For every patient, a phenotypic description has been detailed. Prior to undergoing the gene panel, all patients had a negative CGH array and a fragile-X test, respectively.

The Padua Diagnostic Laboratory has provided VCF files containing sequence data from 146 patients with a range of clinical presentations (approximately 50% and 35% of the patients have been diagnosed with ASD and epilepsy respectively) for which physicians ordered genetic testing. A custom Ion Ampliseq (Ion Torrent, Carlsbad, C) panel was used to generate target amplicon libraries covering exons and exon-adjacent sequences for 74 genes. Sequence data were processed with the Ion Torrent Suite v 5.0 and called variants were annotated with wANNOVAR.

Further information on sequence data processing is available in the headers of the VCF files. The genetic disorders associated with these 74 genes have been grouped into 7 ‘phenotypic traits’: intellectual disability, autism spectrum disorder, epilepsy, microcephaly, macrocephaly or ataxic gait. The ‘phenotypic traits’ for each patient are based directly on information provided by the patient’s physician. Note that each patient can have more than one ‘phenotypic trait’. Causative genes and variants, when identified, are based on the interpretation by the Padua Diagnostic Laboratory.

Prediction challenge
In this challenge predictors are asked to analyze the sequence data for the 74 genes and to predict the ‘phenotypic traits’ of the patient. Additionally, predictors are asked to report the predicted causal variant(s) in each case responsible for each trait thought to be present. (The predicted causal gene(s) are implied by the predicted causal variants.) The submitted prediction should be the probability (0-1; 0 = no disease, 1 = disease) that the patient has a referring phenotypic trait among the 7 ‘phenotypic traits’ in the provided list, and the predicted causal variant(s) from the gene panel sequence dataset. Each predicted probability of phenotypic trait must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of phenotypic trait will be assessed against the information provided by each patient’s physician. The predicted causal variant(s) will be compared to interpretations from the clinical laboratory, but as these are not independently verified, the predictions of causal variants cannot be rigorously assessed.

Prediction submission format
Given a mutation ID, predict the ΔΨ value in HepG2 cells. The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. Each data row in the submitted file must include the following columns:

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

For each patient, give the probability of each phenotypic trait and the predicted causal variant(s) responsible for the patient clinical condition. In the submitted file, each row includes the following columns:

  1. ID: patient ID
  2. ID-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient with intellectual disability.
  3. ID-SD: standard deviation of the previous column indicating the confidence of the prediction
  4. ID-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  5. ID-C: optional brief comment on the basis of the prediction
  6. ASD-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Autism spectrum
  7. ASD-SD: standard deviation of the previous column indicating the confidence of the prediction
  8. ASD-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  9. ASD-C: optional brief comment on the basis of the prediction
  10. Epilepsy-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Epilepsy
  11. Epilepsy-SD: standard deviation of the previous column indicating the confidence of the prediction
  12. Epilepsy-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  13. Epilepsy-C: optional brief comment on the basis of the prediction
  14. Microcephaly-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Microcephaly
  15. Microcephaly-SD: standard deviation of the previous column indicating the confidence of the prediction
  16. Microcephaly-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  17. Microcephaly-C: optional brief comment on the basis of the prediction
  18. Macrocephaly-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Macrocephaly
  19. Macrocephaly-SD: standard deviation of the previous column indicating the confidence of the prediction
  20. Macrocephaly-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  21. Macrocephaly-C: optional brief comment on the basis of the prediction
  22. Hypotonia-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Hypotonia
  23. Hypotonia-SD: standard deviation of the previous column indicating the confidence of the prediction
  24. Hypotonia-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  25. Hypotonia-C: optional brief comment on the basis of the prediction
  26. Ataxic-P: Disease class: Probability (0-1; 0 = no match, 1 = match) that the genome sequence is from a patient Ataxic gait
  27. Ataxia-SD: standard deviation of the previous column indicating the confidence of the prediction
  28. Ataxia-V: indicate diagnostic variant(s). Variants should be reported in the format CHROM:POS:REF:ALT. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated
  29. Ataxia-C: optional brief comment on the basis of the prediction

In the template file, cells in columns 2-29 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly. Patients (rows) and phenotypic traits (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment.

Note that each patient may have more than one phenotypic trait, and all have at least one.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Additional information The gene capture regions used in sequencing are provided as bed files. Related Challenges
Hopkins challenge Sick Kids CAGI4 challenge Sick Kids CAGI5 challenge

Download gene panel sequence dataset
This dataset file is available only to registered users. Please log in to access the file.

Download disorder class descriptions
This descriptions file is available only to registered users. Please log in to access the file.

Download genes
This descriptions file is available only to registered users. Please log in to access the file.

Download genes bed file
This descriptions file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.

Dataset provided by

Dr. Emanuela Leonardi, Prof. Alessandra Murgia, Neurodevelopmental Molecular Genetics Laboratory, Department of Women’s and Children’s Health, University of Padua – Hospital of Padua

Revision history
22 Dec 2017 (v01): initial release
28 Dec 2017 (v01): fixed download