Predict patients’ clinical descriptions and pathogenic variants from gene panel sequences

Challenge: ID Panel

Dataset availability: encrypted, for registered users only

Last updated: 08 June 2021

This challenge is open. This challenge will close on 31 August 2021.

How to participate in CAGI6?                         Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


The objective in this challenge is to predict a patient’s clinical phenotype and the causal variant(s) based on their gene panel sequences. Sequence data for 74 genes from a cohort of 500 patients with a range of neurodevelopmental presentations (intellectual disability, autistic spectrum disorder, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) has been made available for this challenge. Additional data from 150 patients from the same clinical study is available for training and validation.


When a patient’s gene is sequenced and consistent known disease-causing variants are identified, analysis is straightforward. However, in many cases a patient will have one or more novel variants that have never been detected before. The classification of novel variants is a difficult and incompletely solved problem. In the most simplified paradigm, variants would be classified as either pathogenic (disease-causing), uncertain (may or may not contribute to disease), or not pathogenic (unlikely to contribute to disease). Care must be taken in the interpretation of variants as these results are reported back to the physician and can alter a patient’s medical care. As the number of genes sequenced per patient increases, the number of novel unclassified variants also increases. Clinical molecular geneticists must determine which variants, if any, are likely to contribute to the patient’s clinical presentation. The current gold standard for assessing a variant’s pathogenicity is the assortment of the variant with the clinical phenotype in multiple pedigrees and functional assays demonstrating a detrimental effect of that specific nucleotide change. In most instances, when a novel genetic variant is identified there is no rapid and reliable method to assess its pathogenicity. Development of improved computational methods to increase the confidence and accuracy of variant classification tools is a high priority in clinical genetics. 

Neurodevelopmental disorders are common clinically diverse conditions that are genetically heterogeneous. They are often characterized by significant comorbidities such as intellectual disability (ID), Autism Spectrum Disorder (ASD), and epilepsy. Recent studies reported a co-occurrence of ID with ASD (10-40%), and epilepsy (~25%), suggesting shared etiologies among these conditions (Matson et al., 2009; Robertson et al., 2015).

The ID gene panel is recommended as a genetic test within the Molecular Genetics Laboratory for Neurodevelopmental Disorders of the Department of Woman and Child Health, University of Padova. The Lab, located at the Pediatric Research Institute, follows a Quality management system which complies with the European standard ISO 9001:2015. The Lab director Alessandra Murgia, Medical Director and Professor of Pediatrics and Child Neuropsychiatry, is an expert on genetics of X-Fragile syndrome and related neurodevelopmental disorders. Dr. Emanuela Leonardi, Research Scientist is specialized in Genetics and Pathophysiology of neurodevelopmental disorders. For every patient, a phenotypic description has been detailed. Prior to undergoing the gene panel, all patients had a negative CGH array and a Fragile-X test, respectively. 

The Padova Diagnostic Laboratory has provided VCF files containing sequence data from 500 patients with a range of clinical presentations for which physicians ordered genetic testing. A custom Ion Ampliseq (Ion Torrent, Carlsbad, C) gene panel was used to generate target amplicon libraries covering exons and exon-adjacent sequences for 74 genes. Sequence data were processed with the Ion Torrent Suite v5.0 and called variants were annotated with wANNOVAR. Further information on sequence data processing is available in the headers of the VCF files (e.g., Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype). Note that the variants have not been filtered, thus VCF files may contain sequencing errors that should be excluded by sequencing or genotype quality parameters evaluation. 

The genetic disorders associated with these 74 genes have been grouped into 7 phenotypic traits (intellectual disability, autism spectrum disorder, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia). The phenotypic traits for each patient are based directly on information provided by the patient’s physician. Each patient can have one or more phenotypic traits. Causative genes and variants, when identified, are based on the interpretation by the Padua Diagnostic Laboratory.

Prediction challenge

In this challenge predictors are asked to analyze the sequence data for the 74 genes and to predict the phenotypic traits of the patient. The predictors are additionally asked to identify causal variant(s) in each case responsible for each trait thought to be present, with the predicted causal genes implied by the predicted causal variants. The submitted prediction should be a probability in the [0, 1] range (0 = no disease, 1 = disease) that the patient has a referring phenotypic trait among the phenotypic traits in the provided list, and the predicted causal variant(s) from the gene panel sequence dataset. Each prediction score for a phenotypic trait must include a standard deviation (SD) indicating confidence in the prediction. Low SD indicates high confidence, and high SD indicates low confidence. The predictions of phenotypic trait will be assessed against the information provided by each patient’s physician. The predicted causal variant(s) will be compared to interpretations from the clinical laboratory, but as these are not independently verified, the predictions of causal variants cannot be rigorously assessed.

Prediction submission format 

The prediction submission is a tab-delimited text file. The organizers provided a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

For each patient, the predictors are asked to give the probability of each phenotypic trait and the predicted causal variant(s) responsible for the patient’s clinical condition. In the submitted file, each row includes the following columns:

In the template file, cells in columns 2-29 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly. Patients (rows) and phenotypic traits (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment.

Note that each patient may have more than one phenotypic trait, and all have at least one. Also note that all variants should be give for the reference genome GRCh37.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description of your method, use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Training data

Data from the CAGI5 ID panel challenge is available to predictors for training. This dataset has been published by Aspromonte MC et al. (2020).

Additional information

The gene capture regions used in sequencing are provided as bed files. The format in the files follows the description below:

    Field                 Type                      Description

    chrom              String                    The name of the chromosome

    chromStart     Unsigned int64    The starting position of the feature (zero-based)

    chromEnd       Unsigned int64    The ending position of the feature (not inclusive)

    AmpliconID     String                    Amplicon ID

    ID                      String                    A customer-specified ID. If missing, set to '.'. This field is not used.

    GeneSymbol   String                    The gene name

Ethical considerations

The data in this challenge are derived from patients with a range of neurodevelopmental disorders whose underlying etiologies remain poorly understood. As with other forms of disability, individuals identified as neuroatypical may be subject to stigmatization and/or discrimination. However, because these patients already have clinical diagnoses, any risks posed by confirmation of the (known) clinical phenotypes is likely minimal. Predictors are nevertheless encouraged to use care in maintaining the privacy and confidentiality of the data given the sensitive nature of such diagnoses. The implications for patients (or family members) of predicting previously unrecognized clinical phenotypes is less clear. In either case, identification of putative causal variant(s) may provide information about disease biology with potential relevance to clinical care or other therapeutic interventions. 

Related Challenges

Dataset provided by

Alessandra Murgia, Emanuela Leonardi, Maria Cristina Aspromonte, Molecular Genetics of Neurodevelopmental Disorders, Department of Woman and Child Health, University of Padova, Italy


Aspromonte MC, et al. Characterization of intellectual disability and autism comorbidity through gene panel sequencing. Hum Mutat (2020) 40(9):1346-1363. PubMed 

Carraro M, et al. Assessment of patient clinical descriptions and pathogenic variants from gene panel sequences in the CAGI-5 intellectual disability challenge. Hum Mutat (2019) 40(9):1330-1345. PubMed 

Matson JL, Shoemaker M. Intellectual disability and its relationship to autism spectrum disorders. Res Dev Disabil (2009) 30(6):1107-1114. PubMed 

Robertson J, et al. Prevalence of epilepsy among people with intellectual disabilities: a systematic review. Seizure (2015) 29:46-62. PubMed 

Revision history 

03 May 2021: initial release 

24 May 2021: ethical considerations added

08 June 2021: challenge opens