Predicting variant effects in functional regulatory elements using lentiMPRA
Challenge: lentiMPRA
Variant data: registered users only
Last updated: 4 June 2025
This challenge will soon open. The challenge closes on September 15, 2025.
Make sure you understand our Data Use Agreement and Anonymity Policy
Summary
The challenge is to predict the functional impact of genetic variants on regulatory element activity. For this purpose, a subset of functionally validated regulatory elements from a large-scale lentiMPRA study were selected and existing single nucleotide variant alleles (SNVs) in these elements added. SNVs were chosen from the 1000 Genomes Project with a focus on variants with diverse allele frequency distributions and proximity to known genes. Each SNV was tested in both reference and alternative allele contexts using lentiMPRA in HepG2 cells across three biological replicates. Sequences were cloned upstream of a minimal promoter in a barcode-tagged reporter construct. Reporter gene expression was measured relative to the plasmid DNA using short-read sequencing of barcodes from the reporter libraries to determine the activity of the sequences. Variant effects were determined as the difference of paired reference and alternative sequences.
Background
While large-scale functional genomics studies have identified millions of candidate cis-regulatory elements (cCREs), the functional interpretation of genetic variants within these elements remains a major challenge. Most genome-wide association study (GWAS) signals fall outside of coding regions and so do many potentially disease causal variants in rare disease studies. Frequently, they fall within annotated cCREs, which cover a genomic sequence space at least 10x larger than all protein-coding sequences. Due to their sheer numbers, comprehensive experimental mapping and follow-up experiments are limited. This creates a need for computational predictions of molecular effects and the computational prioritization of variants.
Recent advances in massively parallel reporter assays (MPRAs) have enabled systematic functional testing of regulatory variants in cellular systems (Gordon et al., 2020; Klein et al, 2020). A landmark study by Agarwal et al. (2025) used lentivirus-based MPRA (lentiMPRA) to characterize over 680,000 regulatory elements across three cell types, identifying 41.7% as functionally active (Agarwal et al., 2025). This comprehensive dataset provides a foundation for understanding regulatory element function and serves as a resource for selecting elements likely to harbor functional variants. Building on this dataset and a curated set of regulatory elements with demonstrated activity in HepG2 cells, we have designed new MPRA experiments which tested 200 bp sequences with centrally positioned human reference genome allele vs. an alternative allele described in the 1000 Genomes Project (2015). Based on their read outs in HepG2 cells, we curated variants with a broad range of activating and repressing effects, as well as variants without a measurable effect in the MPRAs in HepG2 cells. By testing naturally occurring SNVs from population data, this challenge addresses the critical need for accurate variant interpretation in regulatory genomics.
Experiment
Variant Selection Strategy
Source dataset: Unpublished MPRA experiments of known SNVs positioned within functionally validated regulatory elements from the Agarwal et al. (2025) lentiMPRA study. The experiment was performed in two batches and variants are taken from the combined analysis. As the Agarwal et al. (2025) work was performed on GRCh37/hg19 coordinates, this work uses the same (outdated) genome reference build.
Element types: Both promoters and enhancer CREs represented.
Allele frequency criteria: SNVs with diverse allele frequency distributions (rare, common, and intermediate frequencies) from the 1000 Genomes Project were selected.
Functional priority: Variants are in regions that are within 1 MB from a known genic locus.
lentiMPRA Protocol for Variant Testing
Each selected SNV was tested by creating regulatory element constructs containing either the reference or alternative allele. The experimental protocol followed established lentiMPRA methods (Gordon et al., 2020):
The variant effects were analyzed using BCalm (BarCode analysis using linear models (Keukeleire et al., 2025), an adaptation of mpralm, that models individual barcode counts instead of aggregating counts per sequence. BCalm fits a generalized linear model (GLM) where log₂(RNA/DNA) ratios are modeled as Y = β₀ + βX + ε, where Y represents the log₂(RNA_alt/DNA_alt) observations, β₀ equals log₂(RNA_ref/DNA_ref), and β₁ represents the log fold-change (logFC) effect size between alternative and reference alleles. The design matrix X is 0 for reference sequence counts and 1 for alternative sequence counts, with precision weights calculated using Voom to estimate the mean-variance relationship and account for overdispersion. BCalm employs limma's empirical Bayes variance moderation method to stabilize variance estimates across sequences and control Type I error. Statistical significance is assessed using moderated t-statistics, with p-values derived from the resulting test statistics. A confidence score was derived by capping the p-value of the linear regression at 10-25 and scaling the log10-transformed value between 0 and 1 (i.e., 1 corresponding to a p-value of ≤10-25, 0 to a p-value of 1). A confidence score greater or equal to 0.05 (p-value of ~0.05) indicates that the SNV "has an expression effect" vs. "has no expression effect".
Prediction challenge
Predict the functional effect of single nucleotide variants on regulatory element activity in HepG2 cells.
Training data:
A tab delimited file of variant effects for about 25% of all measured SNVs (for a subset of chromosomes; i.e., chr13, chr14, chr15, chr18, chr19, chr20, chr21, chr22) is provided.
Training data set files have the following columns:
Test data:
Same format as training set but provided without effect sizes (Value) or p-values (Confidence).
Submission format
For the remaining ~75% of the data, participants are asked to submit their predictions based on the chromosome, position (1-based) and the reference and alternate allele.
The prediction submission is a tab-delimited text file. A validation script is provided, and predictors must check the correctness of the format before submitting their predictions.
Each data row in the submitted file must include the following columns:
Submissions will be evaluated based on distance and correlation measures. Note that other evaluation criteria may also be used, at the assessor’s discretion.
In the template file, cells in columns 6-12 are marked with a dot ("."). Submit your predictions by replacing the "." with your value. No empty cells are allowed in the submission. You must submit predictions and confidence values for all the variants; if you are not confident in a prediction for a variant enter an appropriately small confidence value for the prediction. Optionally, enter a brief comment based on the prediction. If you do not enter a comment on a prediction, leave the "." in those cells. Please make sure you follow the submission guidelines strictly.
In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.
File naming
CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model. If you are submitting a single file with all predictions combined, please use the format below.
Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)
To include a description of your method, use the following filename: <teamname>_desc.*
Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.
Download data
Data: available from the Synapse portal
Dataset provided by
Arjun Devadas, Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany.
Ryan D Hernandez, Department of Bioengineering and Therapeutic Sciences, Cardiovascular Research Institute, Quantitative Biosciences Institute, University of California San Francisco, San Francisco CA, USA
Nadav Ahituv, Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California San Francisco, San Francisco CA, USA
Martin Kircher, Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany; Exploratory Diagnostic Sciences, Computational Genome Biology, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
References
Agarwal V, et al. Massively parallel characterization of transcriptional regulatory elements. Nature (2025) 639(8054): 411-420. PubMed
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature (2015) 526(7571):68-74. PubMed
Gordon MG, et al. lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements. Nat Protoc (2020) 15(8):2387-2412. PubMed
Klein JC, et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat Methods (2020) 17(11):1083-1091. PubMed
Keukeleire P, et al. Using individual barcodes to increase quantification power of massively parallel reporter assays. BMC Bioinformatics (2025) 26(1):52. PubMed
Revision history
4 June 2025: challenge preview posted