CAGI Scientist Opening: apply here ☀️

CHEK2 variants in breast cancer patients and controls

Challenge: CHEK2

Dataset description: public

Dataset availability: registered users only

Category: Germline cancer

Last updated: 24 April 2018

This challenge closes at 8:00 PM PST (Pacific Standard Time) on 24 April 2018.

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


Cell-cycle-checkpoint kinase 2 (CHEK2 [OMIM 604373]) is a protein that plays an important role in the maintenance of genome integrity and in the regulation of the G2/M cell cycle checkpoint. CHEK2 has been shown to interact with other proteins involved in DNA repair processes such as BRCA1 and TP53.1 These findings render CHEK2 an attractive candidate susceptibility gene for a variety of cancers. The challenge is to classify variants as occurring in breast cancer cases or controls.


Variants in the CHEK2 gene are associated with breast cancer (Tavtigian & Chenevix-Trench, 2014; Consortium, 2004; Easton et al., 2015). The data available in this challenge include the targeted resequencing of CHEK2 from approximately 1000 Latina breast cancer cases and 1000 ancestry matched controls. Variants in the list were observed between 1 and 20 times.

Prediction challenge

For this experiment, predictors are asked to estimate the probability of a given mutation occurring in a case (cancer) individual. The prediction should be a numeric value P(case) with a standard deviation. Note that in the 2010 experiment with CHEK2 variants (, CAGI participants tended to not properly consider the likely distribution of neutral mutations. A probability of 0.5 would indicate that the mutation is neutral (equal in both populations) while a probability of less than 0.5 would be indicative of a variant that is protective. The CHEK2 gene ID: NM_001005735.1 and protein ID: NP_001005735.1

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. In the submitted file, each row should include the following columns:

In the template file, cells in columns 2-4 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a variant, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Download dataset: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Related Challenges

RAD50 challenge 2012

CHEK2 challenge 2010

Ethical consideration 

The data in this challenge are derived from a cohort of individuals who are described as being “Latina”, a group determined by social relationships rather than by biological characteristics. The prevalence of many gene variants differs in populations of different ancestry, and these differences may sometimes result in differences in disease risk. As a result, the social category of “Latina/o” cannot be fully separated from the implications of the ancestral origins of this population, and issue that is particularly complex in Latino populations because they generally consist of individuals with varying degrees of European, indigenous North American, and African ancestry. Population genetic differences have implications for genomic research because it is only with inclusion of all human populations that a complete understanding of the genetic contributors to health can be fully understood. But the goal of inclusion has not yet been achieved. Under -representation of individuals of non-European ancestry persists. The lack of diversity in population genomic datasets presents a challenge to the meaningful interpretation of genomic findings in diverse populations, as well as a challenge in the just translation of genomic results into clinical care.

This CAGI challenge seeks to inform a better understanding of genome variants that may contribute to breast cancer risk through study of a Latina population. However, it is critical for researchers to attend to the potential negative consequences of associating a particular disease variant with a specific population. Such associations must be drawn with care. Few genomic variants associated with disease are found exclusively in one population. Although it may be useful to study variation within a single under-represented population, in order to ensure broader understanding of genetic contributors to a particular disease, researchers should use care not to assume that differences in disease risk among populations are due to a genetic cause.


Consortium CBCC-C. 2004. Chek2*1100delc and susceptibility to breast cancer: A collaborative analysis involving 10,860 breast cancer cases and 9,065 controls from 10 studies. Am J Hum Genet 74:1175-1182. PMCID:PMC1182081. doi:10.1086/421251

Easton DF, Pharoah PD, Antoniou AC, Tischkowitz M, Tavtigian SV, Nathanson KL, Devilee P, Meindl A, Couch FJ, Southey M, Goldgar DE, Evans DG, Chenevix-Trench G, Rahman N, Robson M, Domchek SM, Foulkes WD. 2015. Gene-panel sequencing and the prediction of breast-cancer risk. N Engl J Med 372:2243-2257. PMCID:PMC4610139. doi:10.1056/NEJMsr1501341

Tavtigian SV, Chenevix-Trench G. 2014. Growing recognition of the role for rare missense substitutions in breast cancer susceptibility. Biomark Med 8:589-603. PMCID:PMC4336165. doi:10.2217/bmm.13.143

Dataset provided by 

Elad Ziv and Susan Neuhausen, UCSF

Revision history 

20 Dec 2017 (v01): initial release 

27 Dec 2017 (v02): Protein ID added 

24 Sep 2018 (v03): Dataset availability added