CHEK2 variants in breast cancer patients and controls

Challenge: CHEK2
Dataset description: public
Category: Germline cancer
Last updated: 12 Jan 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on TBD.

[Summary] [Background] [Prediction Challenge] [Prediction Submission Format] [Ethical consideration] [References] [Revision history]

Cell-cycle-checkpoint kinase 2 (CHEK2 [OMIM 604373]) is a protein that plays an important role in the maintenance of genome integrity and in the regulation of the G2/M cell cycle checkpoint. CHEK2 has been shown to interact with other proteins involved in DNA repair processes such as BRCA1 and TP53.1 These findings render CHEK2 an attractive candidate susceptibility gene for a variety of cancers. The challenge is to classify variants as occurring in breast cancer cases or controls.

Variants in the CHEK2 gene are associated with breast cancer (Tavtigian & Chenevix-Trench, 2014; Consortium, 2004; Easton et al., 2015). The data available in this challenge include the targeted resequencing of CHEK2 from approximately 1000 Latina breast cancer cases and 1000 ancestry matched controls. Variants in the list were observed between 1 and 20 times.

Prediction challenge
For this experiment, predictors are asked to estimate the probability of a given mutation occurring in a case (cancer) individual. The prediction should be a numeric value P(case) with a standard deviation. Note that in the 2010 experiment with CHEK2 variants (, CAGI participants tended to not properly consider the likely distribution of neutral mutations. A probability of 0.5 would indicate that the mutation is neutral (equal in both populations) while a probability of less than 0.5 would be indicative of a variant that is protective. The CHEK2 gene ID: NM_001005735.1 and protein ID: NP_001005735.1

Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. In the submitted file, each row should include the following columns:

  1. Variant - The variant as listed in the prediction dataset file, use the order as provided in the template file on the coding DNA on the reference NM_001005735 (e.g., c.C1751T)
  2. P(case) - The probability of individuals with a given variant being in the case set (Range: 0-1; Note that a probability of 0.5 would indicate that the mutation is neutral (equal in both populations) while a probability of less than 0.5 would be indicative of a variant that is actually protective.)
  3. Standard deviation - This defines the confidence of the prediction in column 2. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.
  4. Comments - Optional brief comments based on the predictions.

In the template file, cells in columns 2-4 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a variant, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Related Challenges
  1. RAD50 challenge 2012
  2. CHEK2 challenge 2010

Download dataset
This dataset file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.

Ethical consideration
The data in this challenge are derived from a cohort of individuals who are described as being “Latina”, a group determined by social relationships rather than by biological characteristics. The prevalence of many gene variants differs in populations of different ancestry, and these differences may sometimes result in differences in disease risk. As a result, the social category of “Latina/o” cannot be fully separated from the implications of the ancestral origins of this population, and issue that is particularly complex in Latino populations because they generally consist of individuals with varying degrees of European, indigenous North American, and African ancestry. Population genetic differences have implications for genomic research because it is only with inclusion of all human populations that a complete understanding of the genetic contributors to health can be fully understood. But the goal of inclusion has not yet been achieved. Under -representation of individuals of non-European ancestry persists. The lack of diversity in population genomic datasets presents a challenge to the meaningful interpretation of genomic findings in diverse populations, as well as a challenge in the just translation of genomic results into clinical care.

This CAGI challenge seeks to inform a better understanding of genome variants that may contribute to breast cancer risk through study of a Latina population. However, it is critical for researchers to attend to the potential negative consequences of associating a particular disease variant with a specific population. Such associations must be drawn with care. Few genomic variants associated with disease are found exclusively in one population. Although it may be useful to study variation within a single under-represented population, in order to ensure broader understanding of genetic contributors to a particular disease, researchers should use care not to assume that differences in disease risk among populations are due to a genetic cause.


  1. Consortium CBCC-C. 2004. Chek2*1100delc and susceptibility to breast cancer: A collaborative analysis involving 10,860 breast cancer cases and 9,065 controls from 10 studies. Am J Hum Genet 74:1175-1182. PMCID:PMC1182081. doi:10.1086/421251
  2. Easton DF, Pharoah PD, Antoniou AC, Tischkowitz M, Tavtigian SV, Nathanson KL, Devilee P, Meindl A, Couch FJ, Southey M, Goldgar DE, Evans DG, Chenevix-Trench G, Rahman N, Robson M, Domchek SM, Foulkes WD. 2015. Gene-panel sequencing and the prediction of breast-cancer risk. N Engl J Med 372:2243-2257. PMCID:PMC4610139. doi:10.1056/NEJMsr1501341
  3. Tavtigian SV, Chenevix-Trench G. 2014. Growing recognition of the role for rare missense substitutions in breast cancer susceptibility. Biomark Med 8:589-603. PMCID:PMC4336165. doi:10.2217/bmm.13.143

Dataset provided by

Elad Ziv,and Susan Neuhausen, UCSF

Revision history
20 Dec 2017 (v01): initial release
27 Dec 2017 (v01): Protein ID added