CAGI Scientist Opening: apply here ☀️

Variants of BRCA1 and BRCA2: predict which variants are associated with increased risk of breast cancer by ENIGMA

Challenge: ENIGMA

Dataset description: public

Dataset availability: registered users only

Dataset availability: Germline cancer

Data: registered users only, limited by CAGI Data Use Agreement

Last updated: 27 April 2018

This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 1 May 2018.

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


Breast cancer is the most prevalent cancer among women worldwide. The association between germline mutations in the BRCA1 and BRCA2 genes and the development of cancer has been well established. The most common high-risk mutations associated with breast cancer are those in the autosomal dominant breast cancer genes 1 and 2 (BRCA1 and BRCA2). Mutations in these genes are found in 1-3% of breast cancer cases. The challenge is to predict which variants are associated with increased risk for breast cancer.


In normal cells, the BRCA1 and BRCA2 genes are involved in homologous recombination for double strand break repair. Mutations in these genes have been linked to development of breast and ovarian cancer (Rehm et al., 2015). The ENIGMA consortium ( is an international consortium focused on determining the clinical significance of sequence variants in BRCA1, BRCA2 and other known or suspected breast cancer related genes, providing expert input to global database and classification initiatives, and exploring optimal avenues of communication of such information at the provider and patient level.

Variants included in the dataset have been classified according to the IARC 5-tier classification scheme using multifactorial likelihood analysis. The procedure assesses clinically-calibrated bioinformatics information and clinical information (pathology, segregation, co-occurrence, family history, case-control) for each variant to produce a likelihood of pathogenicity. Likelihood values were calibrated against the features of known high-risk cancer-causing variants in BRCA1/2 (Goldgar et al., 2008; Plon et al., 2008). Each mutation is assigned to one of five classes depending in the pathogenicity likelihood, as shown in the table. A combination of public and unpublished information has been used to arrive at the final classifications, and all the classifications provided in the dataset for this challenge are either new or improved compared to what is in the public domain..

These data will be included in a publication being developed by the ENIGMA consortium.

Class                                         Probability of Pathogenicity

5: Pathogenic                          >0.99

4: Likely pathogenic               0.95-0.99

3: Uncertain                             0.05-0.949

2: Likely not pathogenic        0.001-0.049

1: Not pathogenic                   <0.001

Prediction challenge

For each variant, participants are asked to submit predictions on the probability that variant is pathogenic according to the ENIGMA consortium classifications. Optionally, a comment on the basis of the prediction may be given.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. Five columns are designated for the probability of assignment to each class (Not pathogenic, Likely not pathogenic, Uncertain, Likely pathogenic, and Pathogenic). Each column must contain a probability between zero and 1.0, and the five probabilities must sum to 1.0. Optionally, a comment on the basis of the prediction may be given in the last column. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. In the submitted file, each row includes the following columns: note Please note that we are asking participants to predict the probability (and confidence) of only the Pathogenic class as indicated in the submission template file.

In the template file, cells in columns 4-6 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a probability for each pathogenicity class for every mutant. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.


Predictors will not be assessed directly against the numeric probability of the ENIGMA classification. Rather, the ENIGMA classification will be used as a weight to assess CAGI predictions. Based on the ENIGMA classification variants will receive the following weights in in the assessment:

Class                                         Weights

5: Pathogenic                          High weight

4: Likely pathogenic               Medium weight

3: Uncertain                             No weight

2: Likely not pathogenic        Medium weight

1: Not pathogenic                   High weight

Download dataset: This dataset file is available only to registered users. Please log in to access the file.

Download submission template - missense variants only: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.


Goldgar DE, Easton DF, Byrnes GB, Spurdle AB, Iversen ES, Greenblatt MS, Group IUGVW. 2008. Genetic evidence and integration of various data sources for classifying uncertain variants into a single model. Hum Mutat 29:1265-1272. PMCID:PMC2936773. doi:10.1002/humu.20897

Plon SE, Eccles DM, Easton D, Foulkes WD, Genuardi M, Greenblatt MS, Hogervorst FB, Hoogerbrugge N, Spurdle AB, Tavtigian SV, Group IUGVW. 2008. Sequence variant classification and reporting: Recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat 29:1282-1291. PMCID:PMC3075918. doi:10.1002/humu.20880

Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, ClinGen. 2015. Clingen - the clinical genome resource. New Engl J Med 372:2235-2242. doi:10.1056/NEJMsr1406261

Data provided by

Amanda Spurdle, QIMR Berghofer Medical Research Institute (Australia), and the ENIGMA consortium.

Revision history

20 Dec 2017 (v01): initial release

26 Jan 2018 (v02): Note added

24 Sep 2018 (v03): Dataset availability added