CAGI Scientist Opening: apply here ☀️

Variants of BRCA1 and BRCA2: predict which variants are associated with increased risk of breast cancer

Dataset description: public

Exome sequence data: registered users only, limited by CAGI Data Use Agreement

This challenge closed on 31 October 2012.

BRCA Challenge answer key (43.5 KB, xls): public

Assessor summary (688 KB, zip): registered users only, limited by CAGI Data Use Agreement

Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement

      Roland Dunbrack: Assessment (2.6 MB, remixable ppt)

Predictions (2.1 MB, zip): registered users only, limited by CAGI Data Use Agreement


In normal cells, the BRCA1 and BRCA2 genes are involved in homologous recombination for double strand break repair and ensure the stability of a cell's genetic material. Mutations in these genes have been linked to development of breast and ovarian cancer (references below). Myriad Genetics created the BRACAnalysis test in order to assess a woman’s risk of developing hereditary breast or ovarian cancer based on detection of mutations in the BRCA1 and BRCA2 genes. This test has become the standard of care in identification of individuals with hereditary breast and ovarian cancer (HBOC) syndrome. It is based on proprietary methods. For each variant in the dataset, Myriad Genetics made one of four classifications:

  • Deleterious
  • Benign
  • Genetic Variant, Favor Polymorphism (VFP)
  • Variant of Unknown Significance (VUS).

These designations are based on a database of patient testing, including frequency of the variants in populations and segregation of variants with disease in families, and are the gold standard in medical diagnoses; patients are making life-changing decisions (e.g., to operate or not) based on these data. Precisely how Myriad Genetics assigns these designations, and their complete database of assignments, is proprietary. Nevertheless, by BRACAnalysis patient test results from clinics where the tests were ordered, it was possible to determine these assignments for the variants observed in patients. These variants and associated pathogenicity assessment were not found in the public domain.

Prediction challenge

For each variant, provide the probability that Myriad Genetics has classified it to be deleterious (Probability 0 – 1 and a standard deviation).

Optional sub-challenge

Additionally provide the probability of Myriad Genetics assigning benign, VFP, or VUS (Probability 0 – 1 and a standard deviation).

Predictions on variants designated VUS will not be included in evaluating the accuracy of the predictions. The assessor will, however, compare different predictions on the VUS variants to understand how different methods approach the question and the extent to which their predictions are consistent.

Dataset information: A set of 100 variants (36 in BRCA1; 64 in BRCA2).

Download dataset: The dataset file is only available for registered users, please log in to access the file.

Prediction submission format 

A flat file containing a list of variants will be provided. Eight columns will be designated for predictions in the form of a probability and standard deviation for each designation: deleterious, benign, VFP, and VUS. Prediction submission format: Please use the submission file template provided for your submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

The submission template file is only available for registered users, please log in to access the file. 

Download BRCA validation script (not available).

Prediction submission template: The prediction submission file is a tab-delimited text file. Each row beyond the header row should include the following columns:

  • Column 1: Gene – BRCA1 or BRCA2
  • Column 2: DNA Variant - The DNA variant as listed in the prediction dataset file
  • Column 3: Protein Variant—The amino acid substitution where appropriate
  • N.B. In some cases one of Column 2 or 3 is contains only "*". Columns 1 – 3 must be non-empty and in the order provided in the template.
  • Column 4, 6, 8, and 10: The probability (P) a variant has been designated as in the header row.
  • Columns 5, 7, 9, and 11: The confidence of each prediction probability in the form of a standard deviation (SD). High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.

Here is a summary of the column designations.

  Gene      DNA Variant      Variant        Deleterious        Benign             VFP              VUS

1               2                          3                  4        5                6        7             8           9      10          11

BRCA#    ##A->B               X##Y           P       SD              P       SD           P          SD    P            SD

In the template file, cells in columns 4-11 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a variant, leave the symbol "*" in these cells. Please make sure you follow the submission guidelines strictly. In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file. To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.


Deadline: 31 October 2012. On 1 November 2012, these variants and their assignments will be released by ClinVar.

Data Provider

Robert Nussbaum, University of California, San Francisco


This challenge is being assessed by Robert Nussbaum, University of California, San Francisco.