CAGI Scientist Opening: apply here ☀️

Predict patients’ clinical descriptions and pathogenic variants from gene panel sequences

Challenge: Hopkins clinical panel

Dataset description: public

Variant data: registered users only, limited by CAGI Data Use Agreement

Last updated: 7 June 2016

This challenge closed at 9:00 PM PST (Pacific Standard Time) on 18 January 2016.

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


DNA sequencing tests are increasingly used in medical practice to confirm clinical diagnoses. However, the interpretation and classification of novel sequence variants identified in a patient remains difficult, even for well-studied disorders like cystic fibrosis. Improved computational methods may aid in the interpretation of sequence variants and, when used in conjunction with clinical data, could increase the confidence of a diagnosis. Until recently, genetic testing was limited to genes associated with a specific clinical phenotype. However, recent technological advances have made it feasible to sequence large gene panels, whole exomes, and genomes. This shift towards analyzing large data sets has led to a need for high-throughput methods to assess and classify variants. The challenge presented here is to use computational methods to predict a patient’s clinical phenotype and the causal variant(s) based solely on analysis of their gene panel sequence data. Sequencing data for 83 genes from a cohort of 106 patients with a range of clinical presentations have been made available for this challenge. 


When a patient’s gene is sequenced and known disease causing variants are identified, analysis is easy. However, in many cases a patient will have one or more novel variants that have never been detected before. The classification of novel DNA variants in a gene is a difficult and incompletely solved problem. In the most simplified paradigm, variants would be classified as either: Pathogenic (disease causing), Uncertain (may or may not contribute to disease), or Not Pathogenic (unlikely to contribute to disease). Care must be taken in the interpretation of variants as these results are reported back to the physician and can alter a patient’s medical care.

As the number of genes sequenced per patient increases, the number of novel, unclassified variants also increases. Clinical molecular geneticists must determine which variants, if any, are likely to contribute to the patient’s clinical presentation. The current gold standards for assessing a variant’s pathogenicity are: assortment of the variant with the clinical phenotype in multiple pedigrees, and functional assays demonstrating a detrimental effect of that specific nucleotide change. In most instances, when a novel genetic variant is identified there is no rapid and reliable method to assess its pathogenicity. Predictive software tools have been developed, but none provide strong evidence to classify a novel variant’s pathogenicity. Development of improved computational methods to increase the confidence and accuracy of variant classification tools is a high priority in clinical genetics.

Hopkins DNA Diagnostic Laboratory

The Johns Hopkins DNA Diagnostic Laboratory ( was established in 1979. The lab is a CLIA and CAP certified; Maryland, New York, and Pennsylvania licensed clinical genetics testing laboratory specializing in rare, inherited disorders. The lab directors are ABMGG board-certified in Clinical Molecular Genetics and Clinical Biochemical Genetics. Dr. Garry R. Cutting, Medical Director and Professor of Pediatrics and Medical Genetics, is an expert in the genetic and non-genetic causes of disease variation in Cystic Fibrosis. Dr. Steven Jeffrey Steinberg, Technical Director and Assistant Professor of Neurology and Genetics, specializes in the genetics and biochemistry of peroxisomal disorders. The lab offers testing for approximately 50 phenotypes and disorders totaling 3,500 tests annually. Techniques used by the lab for mutation detection include: Next Generation Sequencing (NGS), Sanger sequencing, mass spectrometry genotyping, methylation analysis, PCR/fragment analysis, and multiplex ligation-dependent probe amplification (MLPA).

The Johns Hopkins DNA Diagnostic Laboratory has provided vcf files containing sequence data from 106 patients with a range of clinical presentations for which physicians ordered genetic testing. The panel used to test these patients assesses exons and exon-adjacent sequences for 83 genes. Gene sequences were captured using one of two custom probe sets (Agilent SureSelectXT Target Enrichment Kit) and analyzed by next generation sequencing (Illumina MiSeq, 2x100 nt reads); bed files describing the capture regions are provided below. Sequence variants were called _individually_ for each patient to produce two vcf files, one for single nucleotide variants (SNVs; GATK UnifiedGenotyper, v2.7-4) and one for insertion-deletion variants (InDels; GATK HaplotypeCaller, v2.7-4). For convenience, the vcf files from the 106 patients were combined into two files, one each for SNVs and InDels. Further information on sequence data processing is available in the headers of the vcf _files. The genetic disorders associated with these 83 genes have been grouped into 14 ‘disease classes’ which include lung disorders, peroxisomal disorders, aneurysm disorders and craniofacial disorders. More detailed information on these disease classes can be found in the pdf file provided below

The panel of 83 genes was sequenced in all 106 patients, but only those genes related to each patient’s specific disorder were analyzed by the clinical laboratory (n=1-24 genes/patient). In the traditional clinical genetics approach, data about the patient’s symptoms and family history of disease are used to guide analysis of sequence data. In this challenge, however, you will be given sequence data for all 83 genes and asked to predict the disease class of the patient without any additional information. A further part of the challenge will be to predict the specific gene and variant(s) that is/are the underlying cause of disease. The disease class of each patient is based directly on information provided by the patient’s physician. Causative genes and variants, when identified, are based on the interpretation by the Johns Hopkins DNA Diagnostic Laboratory. 

Prediction challenge

Participants are asked to submit predictions of each patient’s disease class from their gene panel sequences. In addition, report the predicted causal variant(s) in each case. (The predicted causal gene(s) are implied by the predicted causal variants.) The submitted prediction should be the probability (0-1; 0 = no disease, 1 = disease) that the patient has a referring disease in each of the 14 disease classes in the provided list, and the predicted causal variant(s) from the gene panel sequence dataset. Each predicted probability of disease class must include a standard deviation indicating confidence in the prediction: low SD indicates high confidence, and high SD indicates low confidence. The predictions of disease class will be assessed against the information provided by each patient’s physician. The predicted causal variant(s) will be compared to interpretations from the clinical laboratory, but these are not known with certainty, so the predictions of causal variants cannot be rigorously assessed.

Download gene panel sequence dataset: This dataset file is available only to registered users. Please log in to access the file.

Download disorder class descriptions: This descriptions file is available only to registered users.  Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download submission validation script: This dataset file is available only to registered users. Please log in to access the file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

For each patient, give the probability of each disease class and the predicted causal variant(s) for that disease class. In the tab-delimited submission template provided, the first column lists the patient and the first row lists each disease class identifier (1 through 14, see disorder class descriptions file) followed by P (for probability), SD (for standard deviation), V (for variant(s)), or C (for comment). Each "*" should be replaced by a probability P of a disease class (real number, 0 to 1; 0 = no disease, 1 = disease), a standard deviation SD indicating the confidence of the disease class prediction, one or more predicted causal variants from the provided vcf files, and an optional brief comment on the basis of the prediction (otherwise leave the “*” in the comment field). Variants for a patient and predicted disorder should be reported within a single V (variant) field in the format CHROM:POS:REF:ALT. If there are multiple variants for a patient and disorder class, they should be reported within the same field and comma-separated. Patients (rows) and disorder classes (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment. Note that every patient matches exactly one disease class.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group.. For more details, please read the FAQ page.

Additional information

The gene capture regions used in sequencing are provided as bed files. Ten patients were sequenced using the “v2” capture regions: P12, P21, P22, P27, P30, P34, P39, P47, P50, P55. All other patients were sequenced using the “v1b” capture regions.

This capture bed file is available only to registered users. Please log in to access the file.

Data provided by

Bethany Buckley, Molly Sheridan, and Garry R. Cutting, The Johns Hopkins University


18 Nov 2015 (v01): initial release

19 Nov 2015 (v02): error in SNV dataset corrected 

20 Nov 2015 (v03): technical details of sequencing and variant calling provided; validation script provided 

26 Feb 2016 (v04): answer key provided 

13 Apr 2016 (v05): predictions and conference presentations provided 

26 Apr 2016 (v06): assessment provided 

7 June 2016 (v07): predictor identities provided