CAGI Scientist Opening: apply here ☀️
Identify the splicing impact of disease-causing mutations in TP53
Dataset description: public
Exome sequence data: registered users only, limited by CAGI Data Use Agreement
This challenge closed on 25 April 2013.
Assessor summary (688 KB, zip): registered users only, limited by CAGI Data Use Agreement
Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement
Jeremy Sanford: Assessment (6.6 MB, remixable ppt)
Sean Mooney: Predictor Talk (460 KB, remixable ppt)
Adam Frankish: Predictor Talk (328 kB, remixable ppt)
Accurate precursor mRNA (pre-mRNA) splicing is required for the expression of protein coding genes from the human genome. In this process, intervening sequences (introns) are removed from pre-mRNA and coding/regulatory sequences (exons) are ligated together generating a mature mRNA . Exonic sequences are densely packed regulatory elements such as splicing enhancers and splicing silencers [2,3].
The function of exonic splicing regulatory elements can be undermined by DNA sequence variation and in some cases can contribute to pathogenesis. Thousands of disease-causing mutations disrupt exonic splicing regulatory elements [4,5]. These data suggest that >25 percent of missense mutations may impact pre-mRNA splicing rather than mRNA translation [4,5].
Predicting the functional impact of DNA sequence variants is an important problem. The solution, at least in part, may be found through a mechanistic understanding of how exon sequences contribute to pre-mRNA splicing. Although not fully understood, exonic splicing enhancers are widely believed to promote exon inclusion through a process called exon definition [6,7]. Exon splicing enhancers are also implicated in exon pairing and regulating the inclusion of adjacent exons [8-13].
Using minigene constructs derived from a fragment of the TP53 gene, we have experimentally determined if each mutation influences splicing fidelity in HEK293T cells. We hope that CAGI participants will be able to predict the outcome of our experiments. A long-term goal will be the computational prioritization of disease-causing mutations prior to experimental validation. This contribution is expected to have major impacts in understanding the pathogenic basis of disease-causing mutations.
The structure of the TP53 gene is given below, with the translated exons in the blue numbered boxes (from http://p53.free.fr/p53_info/p53_gene.html)
With the provided data determine
Dataset: The dataset file is only available for registered users, please log in to access the file.
Prediction submission format
A tab-delimited text file for prediction submissions is provided. There are two header rows: the first designates what is being predicted; the second designates probabilities (P) and standard deviations (SD) for each mutation. Following the two header rows are three rows corresponding to the three mutations presented above, as listed in the first column.
The second column, labeled "Splicing Altered?" contains cells for the probability P (0 – 1) that the splicing is altered for each mutation. The next column holds the standard deviation (SD), defining the confidence of the prediction for that mutation. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.
After the Splicing Altered columns, the next 22 columns correspond to sequential exons between 1 and 11, alternating between the probability P (0-1) that that particular exon is altered for each mutation, and the standard deviation (SD), defining the confidence of the prediction for that mutation.
Here is a summary of the column designations in the template.
Mutation Splicing Altered? In Exon1? ...
P SD P SD P SD
1 * * * * * *
2 * * * * * *
3 * * * * * *
In the template file, all blank cells are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a variant, please leave the sign "*" in these cells. Please make sure you follow these submission guidelines strictly.
In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.
Download Validation script (not available).
Dataset provided by
Linnea Jannson, Tim Sterne-Weiler, and Jeremy Sanford (picture).
This challenge is being assessed by Jeremy Sanford, University of California, Santa Cruz.