CAGI Scientist Opening: apply here ☀️

Mechanisms underlying loci associated with risk of complex trait disease

Dataset description: public

Prediction dataset: public

This challenge closed on 25 April 2013.

Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement

      Roland Dunbrack: Assessment (7.9 MB, remixable ppt)

      Steve Mount: Predictor Talk (3.4 MB, remixable ppt)

Predictions (11.6 MB, zip): registered users only, limited by CAGI Data Use Agreement


GWAS data have now provided extensive information on the relationship between genetic variation and complex traits, including the risk of many common diseases. Knowledge of these associations opens the way to identification of the molecular mechanism underlying each locus, an essential step in understanding the overall disease mechanism. The goal of this challenge is to explore current understanding of these molecular mechanisms, leveraging community expertise to - as far as possible - assign mechanisms to each association locus. The correct mechanisms are unknown, so for this challenge there will be no ranking of accuracy. Rather, the goal is to ascertain which mechanisms appear most relevant, how confidently they can be assigned, and what fraction of loci can currently be assigned plausible mechanisms. Similar challenges were included in the 2010 pilot CAGI and in CAGI 2011. The largest level of participation was in interpretation of missense SNPs, and the results show that although there is considerable diversity among the assignments for different methods, there is also an encouraging degree of consensus. The two expression methods used in 2011 employed complementary approaches, which show moderate overlap. There were also limited predictions for splicing and microRNA involvement. Results from 2011 can be seen under riskSNPs/view results summary.

Prediction challenge

For each locus associated with risk of seven complex trait diseases, assign possible mechanisms of SNP action. Mechanisms include effects from missense substitutions, altered expression level, microRNA binding and splicing.

Dataset Information

he challenge data consist of sets of SNPs that are candidates for involvement in the mechanism underlying each locus known to be associated with one or more of seven common diseases. For the 2012 challenge, the number of loci included has been updated to include all those currently in the NIH GWAS catalog for the seven diseases, and a modified procedure has been used to identify the candidate SNPs. The candidate SNP pool has also increased, and now includes 1000 genome data and the implied LD relationships, extending from the previous much more limited Hapmap set.

Numbers are as follows:

Diseases                                  Loci      # Candidate SNPs         #Loci with at least 1 msSNP

Bipolar Disorder                      63         9843                                 16

Coronary Artery Disease2     33         7080                                 17

Crohn's Disease                      73         17630                               31

Hypertension                           5           490                                   3

Rheumatoid Arthritis              31         27377                              12

Type I Diabetes                       38         42820                              19

Type II Diabetes                      29         5237                                14

Download Dataset and Candidate SNP Template files: here

Download validation script (not available).

Prediction submission format 

A set of downloadable template flat files is provided, one per disease. For each locus, there is a list of candidate SNPs (marker SNPs and those in LD with one or more markers). For each SNP, there are up to seven template fields to be completed:

1. Mechanism code

(Integer, representing the predicted mechanism. Codes are- 1: predicted missense effect, 2: direct expression effect, 3: direct splicing effect, 4: microRNA, 5: message structure, 6: any other mechanism- when using this code please make sure the method description makes the proposed mechanism clear).

2. Method code 

(Integer, identifying the relevant method description – a group may have two different methods for missense predictions for example)

3. Probability the specified mechanism applies 

(Real number between 0.0 - certain it does not apply, 1.0 - certain it does apply)

4. Standard deviation in the probability (3)

(Real number between 0.0 and 1.0, if available)

5. Predicted severity of impact on molecular function 

(Real number between 0.0 - no impact, 1.0 - complete loss of function, if available)

6. Standard deviation in the field (5) impact assignment

(Real number between 0.0 and 1.0, if available)

7. Link to more detailed information 

(if available)


Please avoid ‘:’ in response fields, as this is the field delimiter. All records with any of the template "*"s replaced with data will be processed. No empty cells are allowed in the submission; if you cannot submit predictions for a variant, leave the symbol "*" in these cells.

For those with scripts from previous CAGIs – please note the old (and unnecessary) first field on whether or not a prediction is made for a SNP has been removed.

In addition to the completed template files, your submission must include a separate file with a detailed description of the method(s) used to make the predictions (similar to the style of the Methods section in a scientific article), and including the appropriate method codes used in the (2) fields.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page. Please use the submission file template provided for your submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Data Providers

Lipika R. Pal and John Moult, IBBR, University of Maryland


This challenge is being assessed by John Moult, IBBR, University of Maryland