Variants of BRCA1 and BRCA2: predict which variants are associated with increased risk of breast cancer by ENIGMA

Challenge: ENIGMA
Dataset description: public
Category: Germline cancer
Data: registered users only, limited by CAGI Data use agreement
Last updated: 12 Jan 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on TBD.

[Summary] [Background] [Prediction Challenge] [Prediction Submission Format] [References] [Revision history]

Breast cancer is the most prevalent cancer among women worldwide. The association between germline mutations in the BRCA1 and BRCA2 genes and the development of cancer has been well established. The most common high-risk mutations associated with breast cancer are those in the autosomal dominant breast cancer genes 1 and 2 (BRCA1 and BRCA2). Mutations in these genes are found in 1-3% of breast cancer cases. The challenge is to predict which variants are associated with increased risk for breast cancer.

In normal cells, the BRCA1 and BRCA2 genes are involved in homologous recombination for double strand break repair. Mutations in these genes have been linked to development of breast and ovarian cancer (Rehm et al., 2015). The ENIGMA consortium ( is an international consortium focused on determining the clinical significance of sequence variants in BRCA1, BRCA2 and other known or suspected breast cancer related genes, providing expert input to global database and classification initiatives, and exploring optimal avenues of communication of such information at the provider and patient level.

Variants included in the dataset have been classified according to the IARC 5-tier classification scheme using multifactorial likelihood analysis. The procedure assesses clinically-calibrated bioinformatics information and clinical information (pathology, segregation, co-occurrence, family history, case-control) for each variant to produce a likelihood of pathogenicity. Likelihood values were calibrated against the features of known high-risk cancer-causing variants in BRCA1/2 (Goldgar et al., 2008; Plon et al., 2008). Each mutation is assigned to one of five classes depending in the pathogenicity likelihood, as shown in the table. A combination of public and unpublished information has been used to arrive at the final classifications, and all the classifications provided in the dataset for this challenge are either new or improved compared to what is in the public domain..

These data will be included in a publication being developed by the ENIGMA consortium.

ClassProbability of Pathogenicity
5: Pathogenic>0.99
4: Likely pathogenic0.95-0.99
3: Uncertain0.05-0.949
2: Likely not pathogenic0.001-0.049
1: Not pathogenic<0.001

Prediction challenge
For each variant, participants are asked to submit predictions on the probability that variant is pathogenic according to the ENIGMA consortium classifications. Optionally, a comment on the basis of the prediction may be given. .

Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. Five columns are designated for the probability of assignment to each class (Not pathogenic, Likely not pathogenic, Uncertain, Likely pathogenic, and Pathogenic). Each column must contain a probability between zero and 1.0, and the five probabilities must sum to 1.0. Optionally, a comment on the basis of the prediction may be given in the last column. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. In the submitted file, each row includes the following columns:

  1. Column 1: Gene – BRCA1 or BRCA2
  2. Column 2: DNA Variant - The DNA variant as listed in the prediction dataset file relative to the cDNA. A set of 429 variants (146 in BRCA1 (NM_007294.3); 178 in BRCA2 (NM_000059.3)).
  3. Column 3: Protein Variant—The amino acid substitution where appropriate. Unirprot IDs: BRCA1 (P38398) and BRCA2 (P51587).
  4. Column 4: The probability (P) the variant being pathogenic (0-1). 1 means pathogenic, while 0 means non-pathogenic.
  5. Column 5: The confidence of each prediction probability in the form of a standard deviation (SD). High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.
  6. Columns 6: Optional comment

In the template file, cells in columns 4-6 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a probability for each pathogenicity class for every mutant. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Predictors will not be assessed directly against the numeric probability of the ENIGMA classification. Rather, the ENIGMA classification will be used as a weight to assess CAGI predictions. Based on the ENIGMA classification variants will receive the following weights in in the assessment:

5: PathogenicHigh weight
4: Likely pathogenicMedium weight
3: UncertainNo weight
2: Likely not pathogenicMedium weight
1: Not pathogenicHigh weight

Download dataset
This dataset file is available only to registered users. Please log in to access the file.

Download submission template - missense variants only
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.


  1. Goldgar DE, Easton DF, Byrnes GB, Spurdle AB, Iversen ES, Greenblatt MS, Group IUGVW. 2008. Genetic evidence and integration of various data sources for classifying uncertain variants into a single model. Hum Mutat 29:1265-1272. PMCID:PMC2936773. doi:10.1002/humu.20897
  2. Plon SE, Eccles DM, Easton D, Foulkes WD, Genuardi M, Greenblatt MS, Hogervorst FB, Hoogerbrugge N, Spurdle AB, Tavtigian SV, Group IUGVW. 2008. Sequence variant classification and reporting: Recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat 29:1282-1291. PMCID:PMC3075918. doi:10.1002/humu.20880
  3. Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, ClinGen. 2015. Clingen - the clinical genome resource. New Engl J Med 372:2235-2242. doi:10.1056/NEJMsr1406261

Dataset provided by

Amanda Spurdle, QIMR Berghofer Medical Research Institute (Australia), and the ENIGMA consortium

Revision history
20 Dec 2017 (v01): initial release