Predict common disease phenotypes from individuals’ genotypes

Challenge: PRS

Genotype data: registered users only

Last updated: 13 August 2021

This challenge is open. This challenge will close on 31 August 30 September 2021.

How to participate in CAGI6?                         Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


Polygenic risk scores (PRS) have potential clinical utility for risk surveillance, prevention and personalized medicine. Participants will be provided with datasets of four real phenotypes (Type 2 Diabetes, Breast Cancer, Inflammatory Bowel Disease and Coronary Artery Disease) and of thirty simulated phenotypes representing a range of genetic architectures of common polygenic diseases. The challenge is to predict the disease outcomes of individuals in held-out validation cohorts.


Genetically complex traits including common diseases are highly polygenic with numerous alleles of small effects contributing to genetic risk. A proportion of this risk can be predicted by statistical methods that rely on results of genome-wide association studies (GWAS). Started in agriculture, polygenic prediction found its way to human genetics with advocates arguing that polygenic risk prediction has a potential to inform therapeutic interventions. Currently, most popular methods leverage GWAS summary statistics shared by many studies in the form of polygenic risk scores (PRS); see Torkamani et al. (2018). Some methods use smaller individual level data to learn values of key parameters and some incorporate functional data or fine-mapping techniques. 

This challenge will assess performance of PRS algorithms using data on four phenotypes (Type 2 Diabetes, Breast Cancer, Inflammatory Bowel Disease and Coronary Artery Disease) that represent disease areas that might potentially benefit from PRS because of availability of screening or early intervention options. We will also assess algorithms on a range of simulated data to get insight on the way algorithms perform in various parametric regimes. The datasets were previously described in Chun et al. (2020). 

Unfortunately, due to privacy issues, we are not be able to openly share testing data with CAGI participants. Therefore, this challenge will request a runnable code that can use specified file formats. We will provide publicly available summary statistics to train the methods but will also accept submissions that use other alternative public or proprietary training datasets. For methods that require parameter estimation using smaller individual level data, we will provide this option using a hold-out datasets (not to be openly shared) if the code for parameter training would be submitted by participants.

Arguably, the most serious concern with practical applications of PRS is the transferability between human populations. Differences in demographic histories result in differences in allele frequencies and patterns of linkage disequilibrium that greatly complicate applications of PRS across populations of different ancestry. Although there are promising developments in this area, unfortunately, this challenge will not be able to address this critical issue due to insufficient number of samples with non-European ancestry in the MGB Biobank (Karlson et al., 2016) used for the challenge. Due to this limitation we restrict this challenge to European individuals. We regret this limitation and hope to address it in future CAGI challenges.

Prediction challenge

Participants will be expected to provide a fully trained prediction model that generates predictions for some of all of the four phenotypes and (optionally) for the resealed simulated phenotypes. Alternatively, participants may submit the code that first trains on an individual-level dataset that can be further applied to predict phenotypes in individuals in the testing datasets. All code will be expected to run without internet access. The acceptable code is restricted to Python, R, Matlab and Docker/Singularity images.

Real datasets are provided for the following four real phenotypes: Type 2 Diabetes (T2D), Breast Cancer (BC), Inflammatory Bowel Disease (IBD) and Coronary Artery Disease (CAD). We provide summary statistics from large published GWAS meta-analyses (Scott et al., 2017; Michailidou et al., 2017; Liu et al., 2015; Nelson et al., 2017). Individual-level training cohorts derived from UK Biobank (Bycroft et al., 2018) are available for parameter training if needed. We will assess the prediction accuracy of PRS in the Caucasian subset of MGB Biobank. We provide the marker information of the training and validation data and will not share any individual-level genotypes. 

Simulated datasets were generated under spike-and-slab models representing a range of genetic architectures of common polygenic diseases. A total of 30 datasets are provided to evaluate the accuracy of prediction algorithms under various parameters: the causal fraction of 1% to 0.1%, dependency of heritability on minor allele frequency, and enrichment of heritability in DNase I hypersensitive sites. Each dataset includes GWAS summary statistics (n = 100,000), genotype-level case/control training cohort with phenotypes (n = 2,500/2,500) and genotypes of validation cohort with held-out phenotypes (n = 50,000). The genotypes were simulated from the real LD structure of 5 million common SNPs from the 1000 Genomes Project. The population prevalence of disease was simulated to be 5% for all datasets. The challenge is to predict the diseased individuals in the provided validation cohorts. 

Dataset description

For real phenotypes, we provide the following data: 

  • 1. Published GWAS summary statistics
  • 2. Summary-level SNP information of imputed training data
  • 3. Summary-level SNP information of imputed validation data 
  • 4. Genotype data are in .bgen file format for the training cohort (UK Biobank) and in the PLINK .bed file format for the validation cohort (MGB Biobank). Phenotype data are in a tab-delimited text file with columns of FID, IID, and Outcome. Genotype and phenotype data will not be shared with participants.

We provide a total of 30 simulated datasets. Each dataset contains the following components: 

  • 1. GWAS summary statistics from a cohort of n = 100,000. 
  • 2. Genotypes and phenotypes for case/control training cohort (n = 2,500/2,500) in PLINK .bed format 
  • 3. Unascertained validation cohort (n = 50,000) with held-out phenotypes in PLINK .bed format

In all datasets, genetic data are based on the GRCh37 genome assembly.

Submission format

For real phenotypes, the participants may choose (A) to directly submit a fully trained PRS in the form of adjusted per-allele effect sizes or (B) to submit the code that can generate a prediction using individual-level training and validation datasets. 

For the submission option (A), the adjusted per-allele effect sizes should be submitted in a tab-delimited format with the following information: 

  • 1. SNPID: SNP ID from the the validation cohort marker information table
  • 2. EffectAllele: effect allele
  • 3. Beta: estimated effect size

For the submission option (B), participants should provide the following along with the code so that it can be run out of the box: 

  • 1. Pre-formatted GWAS summary statistics and any other input data required for the code (e.g., reference LD panel) 
  • 2. Script to run the code on individual-level training datasets
  • 3. Script to tune model parameters based on the results of training runs and make the the prediction on the individuals of validation cohorts 
  • 4. Script to make the predictions on the individuals from the validation cohorts
  • 5. Clear instructions on software installation and execution
  • 6. (optional) A file containing a brief description of the method, similar to an abstract or a methods section in a publication

The items 2 and 3 are required only if the participants request training runs on our individual-level dataset. The prediction software (item 4) should generate the predictions in a tab-delimited file with the following format:

  • 1. FID: Sample FID from the validation cohort file 
  • 2. IID: Sample IID from the validation cohort file 
  • 3. PRS: Predicted risk score (real-valued numbers with the higher score corresponding to the higher risk)

For simulated datasets, we expect the prediction results for each individual in the provided validation cohort. The prediction should be provided in a tab-delimited file with each row containing the following information: 

  • 1. FID: Sample FID from the validation cohort file
  • 2. ID: Sample ID from the validation cohort file 
  • 3. PRS: Predicted risk score

Ethical considerations

As noted above, the data available for testing in this challenge derive from European ancestry GWAS participants and for four non-simulated phenotypes only. These features of the challenge directly limit the ability of predictors to design and test the potential transferability of PRS models across different ancestral genetic backgrounds, and possibly also across different polygenic disease risk architectures. While unavoidable in this context, the potential for PRS scores developed for one population to lead to biased or misleading risk estimates for patients from other groups, and hence exacerbate existing health disparities, is a salient concern. To avoid inappropriate clinical application, predictors and assessors should take care to not overstate the generalizability of a given algorithm’s performance.  

Dataset provided by

Sung Chun and Shamil Sunyaev, Harvard Medical School


Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature (2018) 562(7726): 203-209. PubMed 

Chun S, et al. Non-parametric polygenic risk prediction via partitioned GWAS summary statistics. Am J Hum Genet (2020) 107(1): 46-59. PubMed 

Karlson EW, et al. Building the Partners HealthCare Biobank at Partners Personalized Medicine: Informed consent, return of research results, recruitment lessons and operational considerations. J Pers Med (2016) 6(1): 2. PubMed 

Liu JZ, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet (2015) 47(9): 979-986. PubMed 

Michailidou K. Association analysis identifies 65 new breast cancer risk loci. Nature (2017) 551(7678): 92-94. PubMed 

Nelson CP, et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat Genet (2017) 49(9): 1385-1391. PubMed 

Scott RA, et al. An expanded genome-wide association study of Type 2 Diabetes in Europeans. Diabetes (2017) 66(11): 2888-2902. PubMed 

Torkamani A, et al. The personal and clinical utility of polygenic risk scores. Nat Rev Genet (2018) 19(9): 581-590. PubMed 

Revision history 

20 May 2021: initial release

08 June 2021: challenge opens

13 August 2021: submission deadline extended to September 30