Identify the impact of genetic variants on exon splicing (Vex-seq)

Challenge: Vex-seq
Dataset description: public
Category: Splicing
Last updated: 24 Apr 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 2 May 2018.

Although the challenge has closed, late submissions have happened occasionally in CAGI. Our policy is that, out of fairness, these cannot be included in the primary assessment by the assessor. However, the assessor will have access to late submissions and may at their discretion choose to consider them in parallel with the on-time primary submissions. If the assessor chooses to consider them, the results for late submissions will be always labeled as 'late' and kept distinct, but might be mentioned in presentations and in the publication.

[Summary] [Background] [Experiment] [Prediction Challenge] [Prediction Submission Format] [References] [Revision history]


Note
Please find the file with the reference PSI value for the test set
This dataset file is available only to registered users. Please log in to access the file.

Summary
A barcoding approach, Variant exon sequencing (Vex-seq), was applied to assess the effect of 2,059 natural single nucleotide variants and short indels on splicing of a globin mini-gene construct transfected into HepG2 cells. The results are expressed as ΔΨ (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ. The challenge is to predict ΔΨ for each variant.

Background
Variants that impact splicing regulatory elements and so alter splicing outcomes are difficult to detect using conventional poly(A)+ RNA-seq alone because the variants are often spliced out of the mature mRNA. There have been a number of studies that have aimed to address this issue. One approach has been to decipher the “splicing code” using computational techniques (Xiong et al., 2015; Xiong et al., 2011; Leung et al., 2014). While these studies have yielded useful knowledge about splicing and do have predictive power, experimental confirmation has been limited and it is clear that the methods are far from perfect. Other groups have pursued the use of random sequences to understand the splicing code, however it is hard to integrate datasets with contextual transcriptome information (e.g., CLIP) when studying the splicing behavior of random sequences (Rosenberg et al., 2015). A more recent study tested a number of exonic disease-associated variants in parallel using a mini-gene system (Soemedi et al., 2017). The approach was to observe the allelic ratio of reference to variant in a plasmid pool, and compare with the ratios observed from splicing outcomes. This approach is useful for studying exonic variants, but is unable to test intronic variants.

Experiment
The Graveley laboratory has developed the Variant exon sequencing (Vex-seq) methodology, a barcoding approach that is capable of testing the splicing impact of many exonic and flanking intronic variants for the same exon simultaneously. This high-throughput reporter system was applied to determine the impact of genomic variants on pre-mRNA splicing (Adamson et al., 2017). Briefly, a library was developed to determine the splicing of a test exons flanked by two common constitutive exons. The library was introduced into tissue culture cells followed by RT-PCR and sequencing to determine the splicing frequency of each test exon. The reporters also contain a barcode sequence that served as an identifier of which exon was present in each sample. (Adamson et al., 2017).

Vex-seq was applied to 2,059 variants in a high throughput parallel manner. The variants in the study include single nucleotide variants and short indels derived from the ExAC database (Lek et al., 2016). ExAC variants in cassette exons between 68 and 97 nucleotides long and consistent with the experimental approach were arbitrarily selected. These variants were studied in the context of a globin mini-gene construct transfected into HepG2 cells. The products were then sequenced on an Illumina MiSeq. The reads were aligned using STAR to a reference specific to each variant (Dobin et al., 2013). Percent spliced in (PSI or Ψ) and change in PSI (ΔΨ) from the reference sequence were then calculated.

ΔΨ values are calculated from the output of the amplicon sequencing, using the following formula:

Vexseq.png

ΔΨ is the difference between the variant splicing Ψ and the reference splicing Ψ. As an example, if ΔΨ is calculated from a reference exon that is always spliced in (Ψreference = 100), and a variant exon that is only spliced in in half of the transcripts observed for that variant (Ψvariant = 50), the ΔΨ would be -50. ΔΨ is bounded by -100 and 100.

The variants used in the analysis were selected based on the first annotation output by VEP. For more detail information regarding Vex-seq please see (Adamson et al., 2017).

The Vex-seq assay is able to assess only exon skipping versus inclusion. The PSI calculations are based on included exons / (skipped exons + included exons) for each variant. Due to the design of the assay there was no consideration of unspliced events. Similarly, there any other spliced species (e.g., alternative 3’ or 5’ splice sites) were also not considered.

Note that the Vex-seq challenge differs in subtle but important ways from the MaPSy challenge, in addition to the experimental method differences. Both challenges measure abudnace of the RNA species with the reference test exon perfectly spliced in, but the nature of the exon and the RNA species being compared with are different. Vex-seq measures only reference splicing of either exon inclusion or exon skipping of known alternative exons, and considers differences in the inclusion rate (unspliced or misspliced species are ignored). By contrast, MaPSy considers reference splicing of reference constitutive exons, comparing these with unspliced RNAs (exon-skiped or misspliced species are ignored). Thus, Vex-seq considers variation that alters regulation of alternative cassette exon inclusion vs skipping, whereas MaPSy considers variants that impede normal constitutive exon splicing.

Prediction challenge
CAGI Participants are asked to predict the outcome of the Vex-seq experiment. This is expressed in terms of ΔΨ, (delta PSI, or Percent Spliced In), between the variant Ψ and the reference Ψ .

Chromosome positions for the variants and exon studied are given based on hg19 coordinates.

IDchromosomevariant_positionreferencevariantexon_startexon_end
chr10_114724268_T_Cchr10114724268TC114724315114724383
chr10_114724279_A_Gchr10114724279AG114724315114724383
chr10_114724280_T_Gchr10114724280TG114724315114724383

Training set As a training set, we provide a set of 958 variants that were tested by Vex-seq. For every variant in the training dataset we provide the following:

  1. ID: The mutation ID listed as (e.g.chr10_114724268_T_C).
  2. Chromosome: chromosome position
  3. hg19_variant_position: Base pair position on the bh19
  4. Reference: Reference allele
  5. variant: Alternative allele
  6. exon_start: Starting position of the exon
  7. exon_end: Ending position of the exon
  8. HepG2_delta_psi: PSI mean of the HepG2 cell
  9. HepG2_ref_psi: Reference PSI values for the control exons

Testing set The test set includes all 1,099 variants. The file format is as follows:

  1. ID: The mutation ID listed as (e.g.chr10_114724268_T_C).
  2. Chromosome: chromosome position
  3. hg19_variant_position: Base pair position on the bh19
  4. Reference: Reference allele
  5. variant: Alternative allele
  6. exon_start: Starting position of the exon
  7. exon_end: Ending position of the exon

Prediction submission format
Given a mutation ID, predict the ΔΨ value in HepG2 cells. The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. Each data row in the submitted file must include the following columns:

  1. ID - The mutation ID as listed in the dataset file (e.g. chr10_114724268_T_C).
  2. Delta-SPI-score - Prediction of splicing efficiency.
  3. Standard deviation – Standard deviation of the prediction in column 2.
  4. Comment – optional brief comment on the basis of the prediction in column 2

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard deviations for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly. In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Related Challenges
MaPSy

Download training dataset
This dataset file is available only to registered users. Please log in to access the file.

Download test dataset
This dataset file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.

References

  1. Adamson SI, Zhan L, Graveley BR. 2017. doi:10.1101/191122
  2. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. 2013. Star: Ultrafast universal rna-seq aligner. Bioinformatics 29:15-21. PMCID:PMC3530905. doi:10.1093/bioinformatics/bts635
  3. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation C. 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291. PMCID:PMC5018207. doi:10.1038/nature19057
  4. Leung MK, Xiong HY, Lee LJ, Frey BJ. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics 30:i121-129. PMCID:PMC4058935. doi:10.1093/bioinformatics/btu277
  5. Rosenberg AB, Patwardhan RP, Shendure J, Seelig G. 2015. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163:698-711. doi:10.1016/j.cell.2015.09.054
  6. Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. 2017. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49:848-855. doi:10.1038/ng.3837
  7. Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, Hua Y, Gueroussov S, Najafabadi HS, Hughes TR, Morris Q, Barash Y, Krainer AR, Jojic N, Scherer SW, Blencowe BJ, Frey BJ. 2015. Rna splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347:1254806. PMCID:PMC4362528. doi:10.1126/science.1254806
  8. Xiong HY, Barash Y, Frey BJ. 2011. Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context. Bioinformatics 27:2554-2562. doi:10.1093/bioinformatics/btr444

Dataset provided by

Scott I Adamson, Lijun Zhan, Brenton R Graveley, Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington

Revision history
14 Dec 2017 (v01): initial release
16 Jan 2017 (v02): typos fixed
14 Apr 2017 (v03): PSI for test set added