CAGI Scientist Opening: apply here ☀️

Identify diagnostic variants in children with rare disease from the Rare Genomes Project (RGP)

Challenge: RGP research genomes

Genome data: encrypted, for registered users only

Last updated: 12 October 2021

This challenge is closed.

How to participate in CAGI6?                         Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery. The study is led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When a candidate disease variant believed to be related to the phenotype is identified, the variant is confirmed with Sanger sequencing in a clinical setting and returned to the participant via his or her local physician. In this challenge, whole genome sequence data and phenotype data from a subset of the solved and unsolved RGP families will be provided. Participants in the challenge will try to identify the causative variant(s) in each case. For the unsolved cases, prioritized variants from the participating teams will be examined to see if additional diagnoses can be made.


One major obstacle facing rare disease patients is simply obtaining a genetic diagnosis (Rehm, 2017). The average “diagnostic odyssey” for rare disease families lasts more than five years, and over 50% of rare disease patients still lack a genetic diagnosis. A well-recognized obstacle to diagnosis is the technical limitations of the testing method, which determines whether the disease-causing variants can be detected. Here, we present whole genome sequencing data, which provides coverage of coding and noncoding regions and has better variant calling for frameshifts and variants in high GC content regions. The focus of this challenge addresses the analytical obstacle to diagnosis and seeks to evaluate the ability of participants to recognize the pathogenic variation within a sea of benign variation. Current standards for variant interpretation have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) and refined by ClinGen (Richards et al., 2015; Abou Tayoun, et al., 2018). Cases for which the cause is due to structural variation are not included in this challenge.

The Rare Genomes Project 

The Rare Genomes Project (RGP) was launched to study the impact of genome sequencing on rare disease diagnosis, improve access to genomic research to underserved communities, contribute to novel gene discovery, and assess the impact of a genetic diagnosis. The RGP team includes molecular diagnosticians, genetic counselors, clinicians, genomic analysts, computational biologists, software engineers, and project managers. Families in the United States with undiagnosed, but suspected, Mendelian/monogenic diseases apply directly online to the study. Applications are reviewed by the clinical team to confirm a reasonable suspicion for monogenic disease. Subjects are consented using remote video calls, submit blood samples through a local blood draw and mailed kit, and provide medical records. The study generates and analyzes whole genome sequencing data, and variants of clear or potential diagnostic relevance are clinically validated and returned to participants via their local physicians.

Prediction challenge

The prediction challenge involves approximately 30 families. The majority of families are trios or quads, which consist of a proband, both biological parents, and an affected sibling in the case of quads. This data set also includes several duos or proband only cases. The clinical phenotype for each proband is provided in the form of Human Phenotype Ontology (HPO) nomenclature (Robinson & Mundlos, 2010). Participants in this challenge are asked to provide a molecular diagnosis for as many probands as possible; that is, to identify one or more causal variants for each proband.

The set of families in the challenge consists of both “solved” and “unsolved” cases as determined by the Broad Institute’s research team and solved cases have been confirmed by the local clinician, when possible. The solved cases will be used to evaluate the performance of each challenge participant. The unsolved cases have been included in the challenge with the goal of the CAGI community identifying new potentially causal variants. Top candidates are likely to undergo further experimental and clinical evaluation and may potentially lead to results that will be returned to the subjects. The CAGI organizers are not disclosing the numbers of solved and unsolved cases to allow participants to perform the task in a manner that accurately reflects real clinical case analysis.

The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in patients referred for genetic testing. The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by the Broad Institute Genomics Platform on an Illumina sequencer to 30x depth. Raw sequence reads were reassembled against a reference genome (GRCh38) and variant calling was completed with GATK. Sequence results consist of variant calls in the form of single nucleotide variants (SNVs) and small insertion/deletions (indels) within a joint variant call file (vcf). Structural variants are not included in the dataset. 

Prediction submission format 

The prediction submission is a tab-delimited text file. For each proband in the test set, list the proposed causative variant(s) responsible for the phenotype, one variant or pair of variants per line. Each line should be in the following format: PROBANDID:CHROM:POS:REF:ALT:P:SD:C for single causative variants (CVs) or PROBANDID:CHROM:POS:REF:ALT;CHROM:POS:REF:ALT:P:SD:C for proposed biallelic recessive etiologies, where PROBANDID = proband identifier that includes family ID, P = estimated probability of causal relationship (real number, 0 to 1, indicating the degree of certainty), SD = standard deviation (non-negative number reflecting uncertainty about the score P), and C = comment. Comment is optional and should contain the character * if no comment is available; for example, the comment field in the cases of recessive etiologies can be used to give separate scores for each of the two variants. Each variant CHROM:POS:REF:ALT must be listed using the GRCh38 human genome build. Up to 100 CV candidates (lines) can be listed for each proband from the test set. An example submission file and a validation script will be provided for the predictors to check the correctness of the format before submitting predictions. 

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description for your method(s), use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Training data 

Whole-genome sequence data from over thirty solved families are provided for training purposes (joint vcf and meta-data csv file).

Handling VCF files 

A number of tools and libraries exist for navigating and manipulating VCF files. 


Predictions will be assessed by the independent assessors. It is important to mention that causal variants in the solved cases may not be known with certainty. That is, two different clinical groups could potentially come up with a different diagnosis. This situation will be accounted for in the assessment for both fully computational predictions and other clinical laboratories participating in the challenge.

For previously identified causal variants, assessors will review how often these were the top variant(s) returned (e.g., in the top 5, 10, 20, 50, or 100 variants). For the top-performing teams, genomic analysts on the RGP team will re-review the variants for undiagnosed cases to see if they are diagnostic or merit further evaluation.

Ethical considerations 

The data in this challenge are derived from patients with rare, potentially undiagnosed, diseases and their close biological relatives, and may include families who are medically underserved. Identification of putative pathogenic variants (i.e., causal with respect to the clinical phenotype under investigation) may, if confirmed, be important for tailoring clinical interventions and obtaining social services.

The RGP have told participants that they do not actively search for genetic results unrelated to the rare condition in the family but can provide the option of clinically confirming secondary findings if they are discovered. Predictors are reminded that identification of pathogenic variants related to adult-onset conditions (such as might be identified as secondary variants in this challenge) in minors remains ethically controversial, though some commentators see value in the identification of such variants in parents. 

Related challenges

Data provided by the RGP team including

Heidi Rehm, PhD, FACMG, RGP Principal investigator; Anne O’Donnell-Luria, MD, PhD, RGP Medical Director; Melanie O’Leary, CGC, RGP Project Lead


Abou Tayoun AN, et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum Mutat (2018) 39(11):1517-1524. PubMed

Rehm HL. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4):259-267. PubMed 

Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med (2015) 17(5):405-424. PubMed

Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet (2010) 77(6): 525-534. PubMed 

Revision history 

03 May 2021: initial release

26 May 2021: ethical considerations section added, updates on the prediction submission format

08 June 2021: challenge opens

09 June 2021: fixed the link to Synapse web site

13 August 2021: submission deadline extended to September 30

30 September 2021: submission deadline extended to October 11

11 October 2021: challenge closed