Identify diagnostic variants in children with rare disease from the Rare Genomes Project (RGP)

Challenge: RGP research genomes

Genome data: encrypted, for registered users only

Last updated: 03 May 2021

This challenge will open on 01 June 2021. This challenge will tentatively close on 31 August 2021.

Register (not active yet).    Login (not active yet).    Submit predictions (not active yet). 

Make sure you understand our Data Use Agreement and Anonymity Policy


The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery. The study is led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When a candidate disease variant believed to be related to the phenotype is identified, the variant is confirmed with Sanger sequencing in a clinical setting and returned to the participant via his or her local physician. In this challenge, whole genome sequence data and phenotype data from a subset of the solved and unsolved RGP families will be provided. Participants in the challenge will try to identify the causative variant(s) in each case. For the unsolved cases, prioritized variants from the participating teams will be examined to see if additional diagnoses can be made.


One major obstacle facing rare disease patients is simply obtaining a genetic diagnosis (Rehm, 2017). The average “diagnostic odyssey” for rare disease families lasts more than five years, and over 50% of rare disease patients still lack a genetic diagnosis. A well-recognized obstacle to diagnosis is the technical limitations of the testing method, which determines whether the disease-causing variants can be detected. Here, we present whole genome sequencing data, which provides coverage of coding and noncoding regions and has better variant calling for frameshifts and variants in high GC content regions. The focus of this challenge addresses the analytical obstacle to diagnosis and seeks to evaluate the ability of participants to recognize the pathogenic variation within a sea of benign variation. Current standards for variant interpretation have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) and refined by ClinGen (Richards et al., 2015; Abou Tayoun, et al., 2018). Cases for which the cause is due to structural variation are not included in this challenge.

The Rare Genomes Project 

The Rare Genomes Project (RGP) was launched to study the impact of genome sequencing on rare disease diagnosis, improve access to genomic research to underserved communities, contribute to novel gene discovery, and assess the impact of a genetic diagnosis. The RGP team includes molecular diagnosticians, genetic counselors, clinicians, genomic analysts, computational biologists, software engineers, and project managers. Families in the United States with undiagnosed, but suspected, Mendelian/monogenic diseases apply directly online to the study. Applications are reviewed by the clinical team to confirm a reasonable suspicion for monogenic disease. Subjects are consented using remote video calls, submit blood samples through a local blood draw and mailed kit, and provide medical records. The study generates and analyzes whole genome sequencing data, and variants of clear or potential diagnostic relevance are clinically validated and returned to participants via their local physicians.

Prediction challenge

The prediction challenge involves approximately 30 families. The majority of families are trios or quads, which consist of a proband, both biological parents, and an affected sibling in the case of quads. This data set also includes several duos or proband only cases. The clinical phenotype for each proband is provided in the form of Human Phenotype Ontology (HPO) nomenclature (Robinson & Mundlos, 2010). Participants in this challenge are asked to provide a molecular diagnosis for as many probands as possible; that is, to identify one or more causal variants for each proband.

The set of families in the challenge consists of both “solved” and “unsolved” cases as determined by the Broad Institute’s research team and solved cases have been confirmed by the local clinician, when possible. The solved cases will be used to evaluate the performance of each challenge participant. The unsolved cases have been included in the challenge with the goal of the CAGI community identifying new potentially causal variants. Top candidates are likely to undergo further experimental and clinical evaluation and may potentially lead to results that will be returned to the subjects. The CAGI organizers are not disclosing the numbers of solved and unsolved cases to allow participants to perform the task in a manner that accurately reflects real clinical case analysis.

The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in patients referred for genetic testing. The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by the Broad Institute Genomics Platform on an Illumina sequencer to 30x depth. Raw sequence reads were reassembled against a reference genome (GRCh38) and variant calling was completed with GATK. Sequence results consist of variant calls in the form of single nucleotide variants (SNVs) and small insertion/deletions (indels) within a joint variant call file (vcf). Structural variants are not included in the dataset. 

Prediction submission format 

The prediction submission is a tab-delimited text file. The CAGI organizers will provide a template file, which must be used for submission. In addition, a validation script will be provided, and predictors must check the correctness of the format before submitting their predictions.

For each family ID, list the proposed variant(s) for each proband. Up to 100 such variants of variants can be listed, along with the estimated probabilities (P) and standard deviations (SD) for each variant (for proposed recessive etiologies, include biallelic variation paired with the higher scoring variant). In the submission template provided, the first column lists the genome sequence identifiers, and the first row lists the clinical phenotypic description identifiers followed by CVs (for causal variants), P (for probability), SD (for standard deviation), or C (for comment). Secondary variants (i.e., variants identified for conferring high risk of other disorders) can be added in the C (comment) column as free text. Up to 100 predicted diagnostic groups of DVs from the genome sequence dataset can be given.

A probability P of the variant explaining the clinical phenotypic description (real number, 0 to 1, indicating the degree of certainty), a standard deviation SD indicating the confidence of the prediction, optionally one or more predictive secondary variants PSV conferring high risk of other disorders, and an optional comment C on the basis of the predictions (otherwise leave the “*” in the optional fields). Variants should be reported within a single CV field in the format CHROM:POS:REF:ALT:P:SD separated by semi-colon for each variant. The position must be listed using the GRCh38 human genome build. Multiple variants for a genome-phenotypic description match should be reported within the same field and comma-separated. Genomes (rows) and clinical phenotypic descriptions (columns) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment.

Training data 

Whole-genome sequence data from twenty-five solved families are provided for training purposes (joint vcf and meta-data csv file).


Predictions will be assessed by the independent assessors. It is important to mention that causal variants in the solved cases may not be known with certainty. That is, two different clinical groups could potentially come up with a different diagnosis. This situation will be accounted for in the assessment for both fully computational predictions and other clinical laboratories participating in the challenge.

For previously identified causal variants, assessors will review how often these were the top variant(s) returned (e.g., in the top 5, 10, 20, 50, or 100 variants). For the top-performing teams, genomic analysts on the RGP team will re-review the variants for undiagnosed cases to see if they are diagnostic or merit further evaluation.

Related challenges

Data provided by the RGP team including

Heidi Rehm, PhD, FACMG, RGP Principal investigator; Anne O’Donnell-Luria, MD, PhD, RGP Medical Director; Melanie O’Leary, CGC, RGP Project Lead


Abou Tayoun AN, et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum Mutat (2018) 39(11):1517-1524. PubMed

Rehm HL. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4):259-267. PubMed 

Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med (2015) 17(5):405-424. PubMed

Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet (2010) 77(6): 525-534. PubMed 

Revision history 

03 May 2021: initial release