CAGI Scientist Opening: apply here ☀️

Identify the impact of genetic variants on exon splicing (Vex-seq)

Challenge: Vex-seq

Dataset description: public

Dataset availability: registered users only

Category: splicing

Last updated: 24 April 2018

This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 2 May 2018

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.

Note 

Please find the file with the reference PSI value for the test set. This dataset file is available only to registered users. Please log in to access the file.

Summary 

A barcoding approach, Variant exon sequencing (Vex-seq), was applied to assess the effect of 2,059 natural single nucleotide variants and short indels on splicing of a globin mini-gene construct transfected into HepG2 cells. The results are expressed as ΔΨ (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ. The challenge is to predict ΔΨ for each variant.

Background 

Variants that impact splicing regulatory elements and so alter splicing outcomes are difficult to detect using conventional poly(A)+ RNA-seq alone because the variants are often spliced out of the mature mRNA. There have been a number of studies that have aimed to address this issue. One approach has been to decipher the “splicing code” using computational techniques (Xiong et al., 2015; Xiong et al., 2011; Leung et al., 2014). While these studies have yielded useful knowledge about splicing and do have predictive power, experimental confirmation has been limited and it is clear that the methods are far from perfect. Other groups have pursued the use of random sequences to understand the splicing code, however it is hard to integrate datasets with contextual transcriptome information (e.g., CLIP) when studying the splicing behavior of random sequences (Rosenberg et al., 2015). A more recent study tested a number of exonic disease-associated variants in parallel using a mini-gene system (Soemedi et al., 2017). The approach was to observe the allelic ratio of reference to variant in a plasmid pool, and compare with the ratios observed from splicing outcomes. This approach is useful for studying exonic variants, but is unable to test intronic variants.

Experiment 

The Graveley laboratory has developed the Variant exon sequencing (Vex-seq) methodology, a barcoding approach that is capable of testing the splicing impact of many exonic and flanking intronic variants for the same exon simultaneously. This high-throughput reporter system was applied to determine the impact of genomic variants on pre-mRNA splicing (Adamson et al., 2017). Briefly, a library was developed to determine the splicing of a test exons flanked by two common constitutive exons. The library was introduced into tissue culture cells followed by RT-PCR and sequencing to determine the splicing frequency of each test exon. The reporters also contain a barcode sequence that served as an identifier of which exon was present in each sample. (Adamson et al., 2017).

Vex-seq was applied to 2,059 variants in a high throughput parallel manner. The variants in the study include single nucleotide variants and short indels derived from the ExAC database (Lek et al., 2016). ExAC variants in cassette exons between 68 and 97 nucleotides long and consistent with the experimental approach were arbitrarily selected. These variants were studied in the context of a globin mini-gene construct transfected into HepG2 cells. The products were then sequenced on an Illumina MiSeq. The reads were aligned using STAR to a reference specific to each variant (Dobin et al., 2013). Percent spliced in (PSI or Ψ) and change in PSI (ΔΨ) from the reference sequence were then calculated.

ΔΨ values are calculated from the output of the amplicon sequencing, using the following formula:

ΔΨ is the difference between the variant splicing Ψ and the reference splicing Ψ. As an example, if ΔΨ is calculated from a reference exon that is always spliced in (Ψreference = 100), and a variant exon that is only spliced in in half of the transcripts observed for that variant (Ψvariant = 50), the ΔΨ would be -50. ΔΨ is bounded by -100 and 100.

The variants used in the analysis were selected based on the first annotation output by VEP. For more detail information regarding Vex-seq please see (Adamson et al., 2017).

The Vex-seq assay is able to assess only exon skipping versus inclusion. The PSI calculations are based on included exons / (skipped exons + included exons) for each variant. Due to the design of the assay there was no consideration of unspliced events. Similarly, there any other spliced species (e.g., alternative 3’ or 5’ splice sites) were also not considered.

Note that the Vex-seq challenge differs in subtle but important ways from the MaPSy challenge, in addition to the experimental method differences. Both challenges measure abudnace of the RNA species with the reference test exon perfectly spliced in, but the nature of the exon and the RNA species being compared with are different. Vex-seq measures only reference splicing of either exon inclusion or exon skipping of known alternative exons, and considers differences in the inclusion rate (unspliced or misspliced species are ignored). By contrast, MaPSy considers reference splicing of reference constitutive exons, comparing these with unspliced RNAs (exon-skiped or misspliced species are ignored). Thus, Vex-seq considers variation that alters regulation of alternative cassette exon inclusion vs skipping, whereas MaPSy considers variants that impede normal constitutive exon splicing.

Prediction challenge

CAGI Participants are asked to predict the outcome of the Vex-seq experiment. This is expressed in terms of ΔΨ, (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ.

Chromosome positions for the variants and exon studied are given based on hg19 coordinates.

ID                                            chromosome    variant_position    reference    variant    exon_start    exon_end

chr10_114724268_T_C       chr10                 114724268            T                    C             114724315  114724383

chr10_114724279_A_G      chr10                  114724279            A                   G             114724315  114724383

chr10_114724280_T_G      chr10                  114724280            T                    G             114724315  114724383

Training set As a training set, we provide a set of 958 variants that were tested by Vex-seq. For every variant in the training dataset we provide the following:

  • ID: The mutation ID listed as (e.g.chr10_114724268_T_C).
  • Chromosome: chromosome position
  • hg19_variant_position: Base pair position on the bh19
  • Reference: Reference allele
  • variant: Alternative allele
  • exon_start: Starting position of the exon
  • exon_end: Ending position of the exon
  • HepG2_delta_psi: PSI mean of the HepG2 cell
  • HepG2_ref_psi: Reference PSI values for the control exons

Testing set The test set includes all 1,099 variants. The file format is as follows:

  • ID: The mutation ID listed as (e.g.chr10_114724268_T_C).
  • Chromosome: chromosome position
  • hg19_variant_position: Base pair position on the bh19
  • Reference: Reference allele
  • variant: Alternative allele
  • exon_start: Starting position of the exon
  • exon_end: Ending position of the exon

Prediction submission format 

The Given a mutation ID, predict the ΔΨ value in HepG2 cells. The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. Each data row in the submitted file must include the following columns:

  • ID - The mutation ID as listed in the dataset file (e.g. chr10_114724268_T_C).
  • Delta-SPI-score - Prediction of splicing efficiency.
  • Standard deviation – Standard deviation of the prediction in column 2.
  • Comment – optional brief comment on the basis of the prediction in column 2

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard deviations for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly. In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Download training dataset: This dataset file is available only to registered users. Please log in to access the file.

Download test dataset: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Dataset provided by

Scott I Adamson, Lijun Zhan, Brenton R Graveley, Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington.

Related Challenges 

MaPSy

References

Adamson SI, Zhan L, Graveley BR. 2017. doi:10.1101/191122

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. 2013. Star: Ultrafast universal rna-seq aligner. Bioinformatics 29:15-21. PMCID:PMC3530905. doi:10.1093/bioinformatics/bts635

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation C. 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291. PMCID:PMC5018207. doi:10.1038/nature19057

Leung MK, Xiong HY, Lee LJ, Frey BJ. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics 30:i121-129. PMCID:PMC4058935. doi:10.1093/bioinformatics/btu277

Rosenberg AB, Patwardhan RP, Shendure J, Seelig G. 2015. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163:698-711. doi:10.1016/j.cell.2015.09.054

Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. 2017. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49:848-855. doi:10.1038/ng.3837

Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, Hua Y, Gueroussov S, Najafabadi HS, Hughes TR, Morris Q, Barash Y, Krainer AR, Jojic N, Scherer SW, Blencowe BJ, Frey BJ. 2015. Rna splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347:1254806. PMCID:PMC4362528. doi:10.1126/science.1254806

Xiong HY, Barash Y, Frey BJ. 2011. Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context. Bioinformatics 27:2554-2562. doi:10.1093/bioinformatics/btr444 

Revision history 

14 Dec 2017 (v01): initial release 

16 Jan 2018 (v02): typos fixed 

14 Apr 2018 (v03): PSI for test set added 

24 Sep 2018 (v04): Dataset availability added