Predicting splicing disruption from variants of unknown significance

Challenge: Splicing VUS

Variant data: registered users only

Last updated: 08 June 2021

This challenge is open. This challenge will tentatively close on 31 August 2021.

How to participate in CAGI6?                         Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


Variants causing aberrant splicing have been implicated in a range of common and rare disorders, including retinitis pigmentosa, autism spectrum disorder, amyotrophic lateral sclerosis, and a variety of cancers. However, such variants are frequently overlooked by diagnostic sequencing pipelines, leading to missed diagnoses for patients. Clinically ascertained variants of unknown significance underwent whole-blood based RT-PCR to test for impact on splicing. The challenge is to predict which of the tested variants disrupt splicing.


Diagnosis of rare disorders has been revolutionised by recent advances in next generation sequencing, but even with whole-genome sequencing, around half of patients remain undiagnosed (Rehm, 2017). Up to 62% of all pathogenic single-nucleotide variants (SNVs) have been estimated to affect RNA splicing (López-Bigas et al., 2005). However, as with coding variants, a major bottleneck lies in interpreting the functional impact of variants of unknown significance (VUS). Mutations affecting splicing are significant contributors to rare disease, but are frequently overlooked by diagnostic sequencing pipelines, leading to missed diagnoses for patients. In addition, RNA-based therapies, such as splice-modulating antisense oligonucleotides, have shown highly promising results, particularly for neurological disorders (Kim et al., 2019). With many such therapies now entering clinical trials (Khorkhova et al., 2021; Le Marois et al., 2021; Morris et al., 2021), accurate prediction of splice-disrupting variants will help classify VUS and increase diagnostic yields, opening new avenues for genetic counseling and treatment.


Patients were recruited to The Splicing and Disease Study on the basis of having a variant of unknown significance in a disease relevant gene discovered by diagnostic DNA sequencing (gene panel, exome or whole genome). Blood samples were obtained from patients using PaxGene tubes and mRNA was extracted using the PAXgene Blood RNA Kit. Complementary DNA (cDNA) was synthesised by reverse transcription using random hexamer primers. Analysis of patient whole-blood mRNA was conducted using RT-PCR to test whether the variant impacted splicing relative to controls. Primers were designed to amplify the region surrounding each variant of interest, where possible including at least two exons up- and down-stream of the target variant. PCR products were assessed by gel electrophoresis against healthy controls, and purified PCR products were analysed by Sanger sequencing to confirm observed splicing abnormalities.

Experimental work was conducted in two labs – in the Baralle Lab at The University of Southampton, and at Wessex Regional Genetics Laboratory in Salisbury. We previously published findings on over 250 variants analysed in this way (Wai et al., 2020), which includes further detail on recruitment and experimental design in the two labs. Recruitment and testing of patient samples is ongoing.

Prediction challenge

Predict whether the experimentally validated variants of unknown significance impact normal splicing in blood mRNA.

Variants will be provided in the following format with GRCh38 coordinates:

VariantID    chromosome    position    reference    alternative    gene    strand

Example file: example_variant_file_SplicingVUS.txt

We anticipate 50-70 variants to be available for prediction. All variants included will have been identified as potentially diagnostic variants of unknown significance, and will include SNVs and indels, exonic and intronic variants with various functional annotations (e.g., missense, synonymous, intronic). Previously published variants from The Splicing and Disease Study (n = 257) may be used as training data and are provided in Supplementary Table S1 in Wai et al. (2020), and are representative of the types of variants to be included in the challenge.

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. The template file is available here

VariantID    chromosome    position    reference    alternative    score    confidence    classification

Example file: example_submission_format_SplicingVUS.txt

For each variant ID, CAGI participants are asked to provide a prediction “score” in the range [0, 1] or (-Infinity, +Infinity) with the higher score indicating a stronger prediction of splicing disruption. The participants are also asked to provide their confidence of prediction as a non-negative number in the range [0, 1] with a score of 0 indicating the lowest confidence and score of 1 indicating the strongest confidence. Finally, participants are asked to provide binary classification using values from {0, 1} as to whether the variant affects splicing (0 = no splicing effect, 1 = splicing is altered relative to controls).

File naming

CAGI allows submission of six models per team, of which model 1 is considered to be primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description for your method(s), use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.


Assessment will be performed using in the binary classification setting, with emphasis on clinical relevance. Assessment will be carried out by Jenny Lord.

Dataset provided by

Diana Baralle, NIHR Research Professor Genomic Medicine and Honorary Consultant in Clinical Genetics, University of Southampton; Jenny Lord, Postdoctoral Research Fellow, University of Southampton; Andrew Douglas, Consultant and Honorary Senior Clinical Lecturer in Clinical Genetics, University of Southampton; Htoo Wai, Postdoctoral Research Fellow, University of Southampton; David J Bunyan, Wessex Regional Genetics Laboratory and University of Southampton


Kim J, et al. Patient-customized oligonucleotide therapy for a rare genetic disease. N Engl J Med (2019) 38(17): 1644-1652. PubMed 

Khorkhova O, et al. Nucleic acid-based therapeutics in orphan neurological disorders: recent developments. Front Mol Biosci (2021) 8: 643681. PubMed 

Le Marois M, et al. RNA therapeutics for mood disorders: current evidence towards clinical trials. Expert Opin Investig Drugs (2021).  PubMed 

López-Bigas N, et al. Are splicing mutations the most frequent cause of hereditary disease? FEBS Lett (2005) 579(9): 1900-1903. PubMed 

Morris G, et al. Opportunities and challenges for microRNA-targeting therapeutics for epilepsy. Trends Pharmacol Sci (2021) S0165-6147(21)00077-8. PubMed 

Rehm HL, et al. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4): 259-267. PubMed 

Wai HA, et al. Blood RNA analysis can increase clinical diagnostic rate and resolve variants of uncertain significance. Genet Med (2020) 22(6): 1005-1014. PubMed  

Revision history 

20 May 2021: initial release 

08 June 2021: challenge opens