CAGI Scientist Opening: apply here ☀️
Identify the impact of genetic variants on exon splicing (Vex-seq)
Dataset description: public
Dataset availability: registered users only
Last updated: 24 April 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 2 May 2018.
Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.
Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.
Please find the file with the reference PSI value for the test set. This dataset file is available only to registered users. Please log in to access the file.
A barcoding approach, Variant exon sequencing (Vex-seq), was applied to assess the effect of 2,059 natural single nucleotide variants and short indels on splicing of a globin mini-gene construct transfected into HepG2 cells. The results are expressed as ΔΨ (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ. The challenge is to predict ΔΨ for each variant.
Variants that impact splicing regulatory elements and so alter splicing outcomes are difficult to detect using conventional poly(A)+ RNA-seq alone because the variants are often spliced out of the mature mRNA. There have been a number of studies that have aimed to address this issue. One approach has been to decipher the “splicing code” using computational techniques (Xiong et al., 2015; Xiong et al., 2011; Leung et al., 2014). While these studies have yielded useful knowledge about splicing and do have predictive power, experimental confirmation has been limited and it is clear that the methods are far from perfect. Other groups have pursued the use of random sequences to understand the splicing code, however it is hard to integrate datasets with contextual transcriptome information (e.g., CLIP) when studying the splicing behavior of random sequences (Rosenberg et al., 2015). A more recent study tested a number of exonic disease-associated variants in parallel using a mini-gene system (Soemedi et al., 2017). The approach was to observe the allelic ratio of reference to variant in a plasmid pool, and compare with the ratios observed from splicing outcomes. This approach is useful for studying exonic variants, but is unable to test intronic variants.
The Graveley laboratory has developed the Variant exon sequencing (Vex-seq) methodology, a barcoding approach that is capable of testing the splicing impact of many exonic and flanking intronic variants for the same exon simultaneously. This high-throughput reporter system was applied to determine the impact of genomic variants on pre-mRNA splicing (Adamson et al., 2017). Briefly, a library was developed to determine the splicing of a test exons flanked by two common constitutive exons. The library was introduced into tissue culture cells followed by RT-PCR and sequencing to determine the splicing frequency of each test exon. The reporters also contain a barcode sequence that served as an identifier of which exon was present in each sample. (Adamson et al., 2017).
Vex-seq was applied to 2,059 variants in a high throughput parallel manner. The variants in the study include single nucleotide variants and short indels derived from the ExAC database (Lek et al., 2016). ExAC variants in cassette exons between 68 and 97 nucleotides long and consistent with the experimental approach were arbitrarily selected. These variants were studied in the context of a globin mini-gene construct transfected into HepG2 cells. The products were then sequenced on an Illumina MiSeq. The reads were aligned using STAR to a reference specific to each variant (Dobin et al., 2013). Percent spliced in (PSI or Ψ) and change in PSI (ΔΨ) from the reference sequence were then calculated.
ΔΨ values are calculated from the output of the amplicon sequencing, using the following formula:
ΔΨ is the difference between the variant splicing Ψ and the reference splicing Ψ. As an example, if ΔΨ is calculated from a reference exon that is always spliced in (Ψreference = 100), and a variant exon that is only spliced in in half of the transcripts observed for that variant (Ψvariant = 50), the ΔΨ would be -50. ΔΨ is bounded by -100 and 100.
The variants used in the analysis were selected based on the first annotation output by VEP. For more detail information regarding Vex-seq please see (Adamson et al., 2017).
The Vex-seq assay is able to assess only exon skipping versus inclusion. The PSI calculations are based on included exons / (skipped exons + included exons) for each variant. Due to the design of the assay there was no consideration of unspliced events. Similarly, there any other spliced species (e.g., alternative 3’ or 5’ splice sites) were also not considered.
Note that the Vex-seq challenge differs in subtle but important ways from the MaPSy challenge, in addition to the experimental method differences. Both challenges measure abudnace of the RNA species with the reference test exon perfectly spliced in, but the nature of the exon and the RNA species being compared with are different. Vex-seq measures only reference splicing of either exon inclusion or exon skipping of known alternative exons, and considers differences in the inclusion rate (unspliced or misspliced species are ignored). By contrast, MaPSy considers reference splicing of reference constitutive exons, comparing these with unspliced RNAs (exon-skiped or misspliced species are ignored). Thus, Vex-seq considers variation that alters regulation of alternative cassette exon inclusion vs skipping, whereas MaPSy considers variants that impede normal constitutive exon splicing.
CAGI Participants are asked to predict the outcome of the Vex-seq experiment. This is expressed in terms of ΔΨ, (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ.
Chromosome positions for the variants and exon studied are given based on hg19 coordinates.
ID chromosome variant_position reference variant exon_start exon_end
chr10_114724268_T_C chr10 114724268 T C 114724315 114724383
chr10_114724279_A_G chr10 114724279 A G 114724315 114724383
chr10_114724280_T_G chr10 114724280 T G 114724315 114724383
Training set As a training set, we provide a set of 958 variants that were tested by Vex-seq. For every variant in the training dataset we provide the following:
Testing set The test set includes all 1,099 variants. The file format is as follows:
Prediction submission format
The Given a mutation ID, predict the ΔΨ value in HepG2 cells. The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. Each data row in the submitted file must include the following columns:
In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard deviations for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly. In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.
In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.
To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.
Download training dataset: This dataset file is available only to registered users. Please log in to access the file.
Download test dataset: This dataset file is available only to registered users. Please log in to access the file.
Download submission template: This submission template file is available only to registered users. Please log in to access the file.
Download validation script: This submission validation script is available only to registered users. Please log in to access the file.
Dataset provided by
Scott I Adamson, Lijun Zhan, Brenton R Graveley, Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington.
Adamson SI, Zhan L, Graveley BR. 2017. doi:10.1101/191122
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. 2013. Star: Ultrafast universal rna-seq aligner. Bioinformatics 29:15-21. PMCID:PMC3530905. doi:10.1093/bioinformatics/bts635
Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation C. 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291. PMCID:PMC5018207. doi:10.1038/nature19057
Leung MK, Xiong HY, Lee LJ, Frey BJ. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics 30:i121-129. PMCID:PMC4058935. doi:10.1093/bioinformatics/btu277
Rosenberg AB, Patwardhan RP, Shendure J, Seelig G. 2015. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163:698-711. doi:10.1016/j.cell.2015.09.054
Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. 2017. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49:848-855. doi:10.1038/ng.3837
Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, Hua Y, Gueroussov S, Najafabadi HS, Hughes TR, Morris Q, Barash Y, Krainer AR, Jojic N, Scherer SW, Blencowe BJ, Frey BJ. 2015. Rna splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347:1254806. PMCID:PMC4362528. doi:10.1126/science.1254806
Xiong HY, Barash Y, Frey BJ. 2011. Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context. Bioinformatics 27:2554-2562. doi:10.1093/bioinformatics/btr444
14 Dec 2017 (v01): initial release
16 Jan 2018 (v02): typos fixed
14 Apr 2018 (v03): PSI for test set added
24 Sep 2018 (v04): Dataset availability added