Call and identify diagnostic variants in children with rare disease from the Rare Genomes Project (RGP)

Challenge: RGP-cram

Genome data: encrypted, for registered users only

Special considerations: institutional signature required for participation. See below for instructions.

Last updated: 30 November 2025

This challenge will open soon. The challenge closes on February 28, 2026 (preliminary). 

How to participate in CAGI7?               Download data (disabled)              Submit predictions on Synapse 

Make sure you understand our Data Use Agreement and Anonymity Policy

Summary 

The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. In this challenge, mapped reads from short-read genome sequence data and phenotype data from a subset of the solved and unsolved RGP probands will be provided. Participants in the challenge (predictors) will try to identify the causative variant(s) in each proband. For the unsolved probands, prioritized variants from the participating teams will be examined to see if additional genetic diagnoses can be made.

Background 

One major obstacle facing rare disease patients is simply obtaining a genetic diagnosis (Rehm, 2017). The average “diagnostic odyssey” for rare disease families lasts more than five years, and over 50% of rare disease patients still lack a genetic diagnosis (Wojcik et al., 2024). A well-recognized obstacle to diagnosis is the technical limitations of the testing method, which determines whether the disease-causing variants can be detected. Here, we present short-read genome sequencing data, which provides coverage of coding and noncoding regions and has better variant calling for frameshifts and variants in high GC content regions compared to exome sequencing. The focus of this challenge addresses the analytical obstacle to diagnosis and seeks to evaluate the ability of participants to recognize pathogenic variation within a sea of benign variation. Current standards for variant interpretation have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP; Richards et al., 2015) and refined by ClinGen.

The Rare Genomes Project 

The Rare Genomes Project (RGP) is a direct-to-participant research study launched to discover the impact of genome sequencing on rare disease diagnosis, improve access to genomic research, contribute to novel gene discovery, and assess the impact of a genetic diagnosis. The RGP team consists of molecular diagnosticians, genetic counselors, clinicians, genomic analysts, computational biologists, software engineers, and project managers. Families in the United States with undiagnosed, but suspected, Mendelian/monogenic diseases apply directly online to the study. Applications are reviewed by the clinical team to confirm a reasonable suspicion for monogenic disease. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When variants of clear or potential diagnostic relevance are identified, the candidate variants are clinically validated and returned to participants via their local physicians.

Prediction challenge

The prediction challenge involves 24 probands with rare diseases: an Example Set (4 probands) and a Test Set (20 probands). Descriptions of these sets are included below. The clinical phenotype for each proband is provided in the form of Human Phenotype Ontology (HPO) nomenclature (Robinson & Mundlos, 2010). A machine-readable compilation of HPO terms and basic patient information is provided in GA4GH phenopacket format (Jacobsen et al., 2022). The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in patients referred for genetic testing, and most individuals who participate in RGP have had prior negative genetic testing. Participants in this CAGI challenge are asked to call variants and then provide a genetic diagnosis for as many probands as possible; that is, to identify one or more causal variants for each proband. Small variants, structural variants, mitochondrial variants, and tandem repeat expansions are included in the challenge. 

Unsolved probands have been included in the challenge with the objective of enabling the CAGI community to identify new potentially causal variants. Top variant candidates in both the Test Set are likely to undergo further experimental and clinical evaluation, potentially leading to results that will be returned to the patients. This process led to establishing genetic diagnoses for two “unsolved” cases in the RGP challenge of CAGI6 (Stenton et al., 2024).  

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by the Broad Institute Genomics Platform on an Illumina sequencer to 30x depth. Raw sequence reads were reassembled against a reference genome (GRCh38). Sequence results consist of CRAM files.

Prediction submission format 

The prediction submission is a tab-delimited text file. For each proband, rank the proposed causative variant(s) (CV) responsible for the phenotype, one variant or pair of proposed compound heterozygous variants per line (i.e., with line 1 containing the top proposed CV). Each CV should be associated with an estimated probability of causal relationship (EPCR) value, a real number ranging from 0 to 1 indicating the degree of certainty in the causality of the variant(s). For each proband, variants should be ranked in descending order of provided EPCR values.

Each line should be in the following format: PROBANDID:CLASS:CHR:LOCUS:REF:ALT:GENOTYPE:VAF:GENES:EPCR for single causative variants (including homozygous variants) or PROBANDID:CLASS:CHR:LOCUS:REF:ALT:GENOTYPE:VAF:GENES:EPCR;CLASS:CHR:LOCUS:REF:ALT:GENOTYPE:VAF:GENES:EPCR for proposed compound heterozygous recessive etiologies, where PROBANDID = proband identifier that includes family ID. 

The variant CLASS indicates the type of variant, with a list of variant class codes provided below.

Code                                  Description

SNV                                    Single nucleotide variant

INDEL                                Small insertion or deletion

MSNV                                Mitochondrial single nucleotide variant

MINDEL                            Mitochondrial small insertion or deletion

TRE                                    Tandem repeat expansion

BND                                   Translocation

CNV                                   Copy number variant

CPX                                    Complex SV

CTX                                    Reciprocal chromosomal translocation

DEL                                    Deletion

DUP                                    Duplication

INS                                     Insertion

INS_ME                             Mobile element insertion of unspecified ME class

INS_ME_ALU                    Alu element insertion

INS_ME_LINE1                 LINE1 element insertion

INS_ME_SVA                    SVA element insertion

INS_UNK                           Sequence insertion of unspecified origin

INV                                     Inversion

CPX_TYPE_INS_iDEL      Insertion with deletion at insertion site

CPX_TYPE_INVdel          Complex inversion with 3' flanking deletion

CPX_TYPE_INVdup         Complex inversion with 3' flanking duplication

CPX_TYPE_dDUP            Dispersed duplication

CPX_TYPE_dDUP_iDEL  Dispersed duplication with deletion at insertion site

CPX_TYPE_delINV          Complex inversion with 5' flanking deletion

CPX_TYPE_delINVdel     Complex inversion with 5' and 3' flanking deletions

CPX_TYPE_delINVdup    Complex inversion with 5' flanking deletion and 3' flanking duplication

CPX_TYPE_dupINV         Complex inversion with 5' flanking duplication

CPX_TYPE_dupINVdel    Complex inversion with 5' flanking duplication and 3' flanking deletion

CPX_TYPE_dupINVdup   Complex inversion with 5' and 3' flanking duplications

CPX_TYPE_piDUP_FR     Palindromic inverted tandem duplication, forward-reverse orientation

CPX_TYPE_piDUP_RF     Palindromic inverted tandem duplication, reverse-forward orientation

Genotype may be indicated by 0/1 (heterozygous or heteroplasmy), or 1/1 (homozygous or homoplasmy). The fraction of reads supporting the alternative allele (referred to as variant allele fraction [VAF]) should be provided as a value between 0 and 1. All genes expected to be affected by the variant should be listed separated by commas in HGNC format. 

The EPCR values provided are important for assessment (see section Assessment below). The EPCR scores should as much as possible reflect the probability that the variant or variants are causal, and should be in the (0,1] range, meaning greater than zero and less than or equal to one.  

Each variant CHROM:POS:REF:ALT must be listed using the human genome build GRCh38. Up to 100 CV candidates (lines) can be listed for each proband from the Test Set. An example submission file and a validation script will be provided for the predictors to check the correctness of the format before submitting predictions.

File naming

CAGI allows submission of up to six models per team, of which model number 1 is considered primary. You can upload predictions for each model multiple times during the submission window; the last submission before the deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Each model must include a detailed description of your method(s). Results for a model will not be assessed without adequate information on how the model works. For this, please use the method description form and rename it to <teamname>_description_model_(1|2|3|4|5|6).xlsx. When applicable, include any publicly available databases used (with version or access date) and selection thresholds (e.g., for variant quality, frequency, functional consequence, and so on). 

Additionally, upon submission predictors will be asked to complete a brief survey about your method(s) and protocol(s). The draft survey is available here

The results from this challenge must remain confidential until the challenge is completed. Please do not mark these cases as solved in any systems that share variant data to avoid introducing information leak across teams and bias in the challenge assessment.

Example set 

Short-read genome sequence data with phenotype data and genetic diagnosis from 4 solved probands are provided for training purposes (cram and meta-data csv file).

Test set 

Short-read genome sequencing data with phenotype data from a mix of 20 solved and unsolved RGP probands. The prediction submission file on these 20 probands will be used for assessment.

Handling CRAM files 

CRAM files, which are a compressed, reference-based format for storing aligned sequence reads, serve as input for several sophisticated bioinformatics tools designed for comprehensive variant calling. For detecting Single Nucleotide Variants (SNVs) and short insertions and deletions (indels), the widely adopted GATK HaplotypeCaller, DeepVariant, and DRAGEN are possible options. Specialized calling for mitochondrial variants (SNVs and indels) often involves dedicated workflows using tools like Mutect2 from the GATK suite, which includes specific modes and parameters to handle the high copy number and low allele frequency characteristic of mitochondrial DNA. Structural Variants (SVs)—including large deletions, duplications, inversions, and translocations—can be identified using tools such as Manta, which integrates paired-end and split-read evidence to discover and assemble breakpoints, or DELLY and Lumpy, which leverage read-pair and split-read signatures. There are also approaches that use multiple tools like GATK-SV. Finally, identifying variations in tandem repeats (TRs) often requires specialized tools like GangSTR or ExpansionHunter, as TR regions are challenging for standard callers; these tools specifically model the repetitive nature of the sequence to accurately genotype the number of repeat units.

If you perform the challenge remapping to GRCh37, please use a liftover tool from UCSC or the Broad Institute to remap the coordinates to GRCh38 for submission.

Assessment 

Evaluation of the Test Set submission file will resemble that from the CAGI6 Rare Genomes Project Challenge (Stenton et al., 2024), but additional strategies will also evaluate the quality of calibrated EPCR values in each submission.

Predictions will be assessed by independent assessors, blinded to the identity of the teams and the methods. It is important to mention that causal variants in the solved probands may not be known with absolute certainty. The answer key used in our assessment therefore reflects the best of our team’s abilities to identify causal variants by applying available evidence and following current clinical field standards. 

For previously identified CVs (true positives in the test set), assessors will review how often these were the top ranked variant(s). A weighted score based on these rank positions will be used as an indicator of model performance. The EPCR values will also be used to assess how often a model identifies true positive CVs versus false positive CVs across a set of EPCR thresholds, to provide a measure of model sensitivity, specificity, and positive predictive value. Partial credit will be given for probands where only one of two causal variants in a gene are identified for a proband. Additional assessments will be made in an informed phase in order to appropriately weigh rankings with the information collected in the methods survey. The combined assessments will determine the top-performing teams. For the top-performing teams, genomic analysts on the RGP team will re-review the variants for undiagnosed probands in both the Test Set and, when provided, the Discovery Set with EPCR calibrated values ≥0.1 to see if they are diagnostic or merit further evaluation.

The participating teams may want to ensure that at least one of their submissions normalizes the output scores over the listed variants (per proband) to sum to a value no greater than 1. Such predictions are likely to contain a smaller number of potentially causal variants and may be of interest for both the assessment and followup by the RGP team.

Ethical considerations 

The data in this challenge are derived from individuals with rare, potentially undiagnosed, diseases and their close biological relatives, and may include families who are medically underserved. Participants sign up for RGP because they are interested in research to improve rare disease diagnosis and activities like CAGI challenges are consistent with this goal. Identification of putative pathogenic variants (i.e., causal with respect to the clinical phenotype under investigation) may, if confirmed, be important for tailoring clinical interventions and obtaining services. Families have consented that data will be analyzed for findings related to the stated phenotype and that no active searching is conducted for variants unrelated to the rare condition in the family. Predictors are reminded to only submit causal variant candidates considered to be of relevance to the proband’s provided disease phenotype (i.e., do not submit secondary findings). We also remind all teams to ensure the ethical stewardship of challenge data by limiting the use of this data to CAGI challenge participants only. This data should be destroyed after the challenge and no efforts shall be made by any parties to try to identify the source of the data. 

Data Use Agreement: special considerations 

All teams participating in any CAGI challenge are required to sign and adhere to the CAGI Data Use Agreement. Additionally, in this challenge, an institutional signature must also be provided in order to obtain individual-level de-identified data. All communication for providing institutional signatures must come from valid institutional email addresses and the signed form must be emailed by a person who attests the authority to provide such a signature on behalf of the institution. To do so, please complete this form and email it to cagi@genomeinterpretation.org from a valid institutional email address (gmail, hotmail, etc. cannot be accepted). Note: a separate form must be signed for every person who will “touch” the data. Finally, CAGI must follow the United States export control laws when providing data to entities outside of the United States of America or those that are majority-owned by the countries other than the United States. The same approval can be used for both RGP-VCF and RGP-CRAM challenges.

Due to the listed special considerations, challenge data is provided in an encrypted form to approved participants. We expect to provide decryption passwords to approved participating teams using encrypted email communication (e.g., PGP), via end-to-end encrypted apps (e.g., Signal) or verbally over a phone. Please allow for an extended time to complete institutional signatures and obtain approvals from CAGI organizers. In previous CAGI challenges, such a process could take between a few days to several weeks, depending on the preparedness of an institution to handle such requests. 

Related challenges

Data provided by the RGP team including

Heidi Rehm, PhD, FACMG, RGP Principal investigator; Anne O’Donnell-Luria, MD, PhD, RGP Medical Director; Melanie O’Leary, CGC, RGP Project Lead; Stephanie DiTroia, PhD, Principal Genomic Analyst 

References 

Jacobsen JOB, et al. The GA4GH Phenopacket schema defines a computable representation of clinical data. Nat Biotechnol (2022) 40(6):817-820. PubMed 

Rehm HL. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4):259-267. PubMed 

Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med (2015) 17(5):405-424. PubMed

Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet (2010) 77(6): 525-534. PubMed 

Stenton SL, et al. Critical assessment of variant prioritization methods for rare disease diagnosis within the rare genomes project. Hum Genomics (2024) 18(1):44. PubMed 

Wojcik MH, et al. Genome sequencing for diagnosing rare diseases. N Engl J Med (2024) 390(21):1985-1997. PubMed 

Revision history 

30 November 2025: challenge preview posted