Critical Assessment of Genome Interpretation

Molecular mechanisms underlying associations between genetic variation and disease risk

Goal

New GWAS data are revolutionizing our knowledge of the relationship between genetic variation and the risk of complex trait disease risk: There are now 100s of new reliable associations between the presence of SNPs at a particular locus and increased risk of a common disease. Association is not mechanism, but knowledge of association opens the way to discovering mechanism.

There are a substantial number of mechanisms by which a SNP or other genetic variant could affect molecular function in vivo and thus contribute to disease risk. Effects on expression, message processing, message stability, protein folding, post-translational modification and transport can alter the rate at which new protein molecules are deployed. The specific activity of a protein may be affected by changes in ligand binding affinity, catalytic rate or allosteric conformational changes. The half-life of a protein species may be affected by increased rates of chemical change, altered ubiquitination rate, or decreased thermodynamic stability. Many of these factors also apply to non-coding RNA molecules, particularly micro-RNAs, and occurrence of some associations in gene free regions suggests these or other unknown factors play a significant role. Some mechanisms can be analyzed by existing methods or relatively simple extension there of. Others may require the development of new approaches.

The goal of this experiment is to explore current understanding of the molecular level mechanisms underlying associations between SNPs and disease risk, incorporating expertise in each of the known mechanism areas, and as far as possible assigning possible mechanisms for each association locus. The correct mechanisms are unknown, so there can be no ranking of accuracy - that is not the point of the experiment. Rather, the goal is to ascertain which mechanisms appear most relevant, how confidently they can be assigned, and what fraction of loci can currently be assigned plausible mechanisms.

In addition to the insights provided by the numerical evaluation, we anticipate a number of softer but potentially more significant outcomes. Among these are: which mechanisms require more methods development before a satisfactory analysis can be made?, in which areas do we need more data before being able to properly apply analysis methods?, what is likely to be the role of mechanisms other than those mediated through common SNPs?

A similar dataset was included in the pilot CAGI experiment conducted in the Fall of 2010. The only change in the current set is extension of Crohn’s disease coverage to 69 loci, up from 31. Results from the pilot can be seen on the CAGI 2010 archive [link]. Those preliminary results are already interesting, particularly in the spread of mechanism assignments. With the expected greater number of participants in the 2011 full experiment, we should obtain a clear picture of the current state of the art in assigning possible SNP mechanisms for GWAS data.

Participation

Participation is open to all. Participation will be particularly solicited from those known to have relevant methods. Participants are required to register through the experiment web site. Participants are invited to submit comments and suggestions through the experiment discussion forum, so the process can be tuned appropriately. Specific queries and concerns can also be sent to [jmoult at umd dot edu]

Data

A list of marker SNPs in known association loci together with all SNPs in significant LD with these will be supplied (typically a total of 200 - 300 candidate SNPs per locus). Markers will be taken from the WTCCC1 seven disease study and published follow-up studies. Approximately 150 high reliability loci will be included.

Summary of proposed data for analysis

For each of the WTCCC1 study diseases, the number of high confidence loci is given, together with the total number candidate SNPs (markers plus those in significant LD and adequately penetrant), and the number of loci with at least one candidate non-synonympous SNP. Note that the number of loci here are less than those in earlier versions of the table, because of restriction to high confidence loci.

Diseases # Loci # Candidate SNPs #Loci with at least 1 nsSNP

Bipolar Disorder 22 6655 11

Coronary Artery Disease 10 2200 6

Crohn's Disease 69 15375 49

Hypertension 2 812 1

Rheumatoid Arthritis 23 11057 15

Type I Diabetes 34 17891 26

Type II Diabetes 20 4434 12

Sample data for one WTCCC1 disease locus (Crohn's disease (CD), chromosome 3):

Sample data file available here: sample data. Each line provides the information for one candidate SNP (149 total). Three of these are high confidence markers: rs9827708, rs11718165, rs9858542 and non-synonymous ones are: rs13068038, rs34762726, rs2005557, rs3197999, rs34823813. Participants are asked to identify which mechanisms could apply to which SNPs, using their expertise on a specific mechanism or mechanisms.

In order to view all datasets available for riskSNPs project, you need to be a registered user/logged in.

Choose the appropriate option from your left side menu.

Prediction collection and processing:

A set of downloadable flat files will be provided, one per disease. For each locus, a list of candidate SNPs will be provided (marker SNPs and those in LD with one or more markers). For each candidate SNP, template result fields will be provided. Participants will complete these fields. Eight fields are proposed:

Prediction flag. No prediction for that candidate SNP (0, default)/Prediction (1)
Mechanism code (e.g. 1: missense effect, 2: direct expression impact, 3: direct splicing impact…)
Method code (see below)
Probability the specified mechanism applies (0 - certain it does not apply, 1 - certain it does apply)
Uncertainty (standard deviation) in that probability (if available)
Predicted severity of impact on molecular function (0 - no impact, 1 - complete loss of function) (if available)
Uncertainty (standard deviation) in the impact assignment (if available)
Link to more detailed information (if available)

Completed template files will be uploaded to a server. The server will check submitted files for completeness and conformity (i.e. numbers in the correct range, existing method code, recognized mechanisms code). An accession number will be issued for each accepted file, providing evidence of submission.

In addition, each participant is required to deposit a detailed methods description corresponding to each numeric methods code. Detailed information of each analysis may also be uploaded as a set of zipped files. Additionally, participants may provide a link to the more detailed information for each prediction.

The field layout and rules will be finalized following consultation with participants.

Evaluation:

Straightforward statistics evaluating the relevance of the different mechanisms, the confidence of assignments, and the fraction of loci to which mechanism can be assigned will be complied from the results. Where multiple participants seek to identify the same mechanism (as is likely for missense SNPs, for example), predictions will be compared. Other appropriate numerical evaluation methods may be added in the light of the results.

Release of results

All predictions, methods, discussions and statistics will be made available through the experiment web site.

Caveats

This is an experiment - there is a well defined scientific question, and we seek answers. However, seeking answers is no guarantee of finding them. Never-the-less, its seems likely we will at least learn a great deal.

The analysis is restricted to the role of common SNPs, since those are the data currently available. It is likely that rare single base variants, indels, and copy number variation all also play a significant role. While these effects cannot be included at present, in a short while the necessary data will be available, and we look forward to follow-up experiments.

The analysis is also restricted to the molecular mechanisms by which SNPs influence disease risk. Pathway, susbsystem, epistatic, and other higher level mechanisms are not included. The importance of these components is obvious, but as yet there is much less known, and so it is not possible to formulate a precise set of addressable questions. But suggestions are strongly encouraged…

Data provided by

Lipika R. Pal and John Moult, IBBR, University of Maryland

Center for Critical Assessment of Genome Interpretation

Register/Login

Critical Assessment of Genome Interpretation