Critical Assessment of Genome Interpretation

Predicting variant effects in functional regulatory elements using lentiMPRA

Challenge: lentiMPRA

Variant data: registered users only

Last updated: 19 October 2025

This challenge is closed. The challenge closed on October 15, 2025.

How to participate in CAGI7? Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy

Summary

The challenge is to predict the functional impact of genetic variants on regulatory element activity. For this purpose, a subset of functionally validated regulatory elements from a large-scale lentiMPRA study were selected and existing single nucleotide variant alleles (SNVs) in these elements added. SNVs were chosen from the 1000 Genomes Project with a focus on variants with diverse allele frequency distributions and proximity to known genes. Each SNV was tested in both reference and alternative allele contexts using lentiMPRA in HepG2 cells across three biological replicates. Sequences were cloned upstream of a minimal promoter in a barcode-tagged reporter construct. Reporter gene expression was measured relative to the plasmid DNA using short-read sequencing of barcodes from the reporter libraries to determine the activity of the sequences. Variant effects were determined as the difference of paired reference and alternative sequences.

Background

While large-scale functional genomics studies have identified millions of candidate cis-regulatory elements (cCREs), the functional interpretation of genetic variants within these elements remains a major challenge. Most genome-wide association study (GWAS) signals fall outside of coding regions and so do many potentially disease causal variants in rare disease studies. Frequently, they fall within annotated cCREs, which cover a genomic sequence space at least 10x larger than all protein-coding sequences. Due to their sheer numbers, comprehensive experimental mapping and follow-up experiments are limited. This creates a need for computational predictions of molecular effects and the computational prioritization of variants.

Recent advances in massively parallel reporter assays (MPRAs) have enabled systematic functional testing of regulatory variants in cellular systems (Gordon et al., 2020; Klein et al, 2020). A landmark study by Agarwal et al. (2025) used lentivirus-based MPRA (lentiMPRA) to characterize over 680,000 regulatory elements across three cell types, identifying 41.7% as functionally active (Agarwal et al., 2025). This comprehensive dataset provides a foundation for understanding regulatory element function and serves as a resource for selecting elements likely to harbor functional variants. Building on this dataset and a curated set of regulatory elements with demonstrated activity in HepG2 cells, we have designed new MPRA experiments which tested 200 bp sequences with centrally positioned human reference genome allele vs. an alternative allele described in the 1000 Genomes Project (2015). Based on their read outs in HepG2 cells, we curated variants with a broad range of activating and repressing effects, as well as variants without a measurable effect in the MPRAs in HepG2 cells. By testing naturally occurring SNVs from population data, this challenge addresses the critical need for accurate variant interpretation in regulatory genomics.

Experiment

Variant Selection Strategy

Source dataset: Unpublished MPRA experiments of known SNVs positioned within functionally validated regulatory elements from the Agarwal et al. (2025) lentiMPRA study. The experiment was performed in two batches and variants are taken from the combined analysis. As the Agarwal et al. (2025) work was performed on GRCh37/hg19 coordinates, this work uses the same (outdated) genome reference build.

Element types: Both promoters and enhancer CREs represented.

Allele frequency criteria: SNVs with diverse allele frequency distributions (rare, common, and intermediate frequencies) from the 1000 Genomes Project were selected.

Functional priority: Variants are in regions that are within 1 MB from a known genic locus.

lentiMPRA Protocol for Variant Testing

Each selected SNV was tested by creating regulatory element constructs containing either the reference or alternative allele. The experimental protocol followed established lentiMPRA methods (Gordon et al., 2020):

Library construction: 200bp sequences centered on each variant position
Barcode association: Multiple unique barcodes per variant allele (target: 50+ barcodes)
Lentiviral transduction: HepG2 cells infected with pooled MPRA libraries in biological triplicate
Activity measurement: RNA/DNA barcode ratios from RNA and DNA counts
Statistical analysis: Variant effects analyzed using BCAlm (Keukeleire et al., 2025).

The variant effects were analyzed using BCalm (BarCode analysis using linear models (Keukeleire et al., 2025), an adaptation of mpralm, that models individual barcode counts instead of aggregating counts per sequence. BCalm fits a generalized linear model (GLM) where log₂(RNA/DNA) ratios are modeled as Y = β₀ + βX + ε, where Y represents the log₂(RNA_alt/DNA_alt) observations, β₀ equals log₂(RNA_ref/DNA_ref), and β₁ represents the log fold-change (logFC) effect size between alternative and reference alleles. The design matrix X is 0 for reference sequence counts and 1 for alternative sequence counts, with precision weights calculated using Voom to estimate the mean-variance relationship and account for overdispersion. BCalm employs limma's empirical Bayes variance moderation method to stabilize variance estimates across sequences and control Type I error. Statistical significance is assessed using moderated t-statistics, with p-values derived from the resulting test statistics. A confidence score was derived by capping the p-value of the linear regression at 10-25 and scaling the log10-transformed value between 0 and 1 (i.e., 1 corresponding to a p-value of ≤10-25, 0 to a p-value of 1). A confidence score greater or equal to 0.05 (p-value of ~0.05) indicates that the SNV "has an expression effect" vs. "has no expression effect".

Prediction challenge

Predict the functional effect of single nucleotide variants on regulatory element activity in HepG2 cells.

Training data:

A tab delimited file of variant effects for about 25% of all measured SNVs (for a subset of chromosomes; i.e., chr13, chr14, chr15, chr18, chr19, chr20, chr21, chr22) is provided.

Training data set files have the following columns:

Chrom: GRCh37/hg19 chromosome name
Pos: Chromosomal position (1-based)
Strand: Strand (+/-)
Ref: Reference allele
Alt: Alternate allele
Value: Estimated variant effect from the BCalm analysis.
Confidence: Confidence score derived from the p-value. A ‘significant expression effect’ in the provided data is one assigned a confidence score greater than or equal to 0.05.

Test data:

Same format as training set but provided without effect sizes (Value) or p-values (Confidence).

Submission format

For the remaining ~75% of the data, participants are asked to submit their predictions based on the chromosome, position (1-based) and the reference and alternate allele.

The prediction submission is a tab-delimited text file. A validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Each data row in the submitted file must include the following columns:

Chrom: GRCh37/hg19 chromosome name
Pos: Chromosomal position (1-based)
Strand: Strand (+/-)
Ref: Reference allele
Alt: Alternate allele
Value: Quantitative prediction of the relative expression change (negative values indicating repression over wild-type/reference, positive values indicating activation over wild-type).
Confidence: Range 0.0 to 1.0 (1.0 implies total confidence in the assignment, 0 implies that it is a random assignment).
Effect: Direction independent, prediction of effect/no effect: submit 0 for no effect, 1 for a significant effect in transcriptional activity. A ‘significant effect’ is one assigned an experimental confidence score greater than or equal to 0.05 (Refer to Experiment section).
P_Effect: Probability that the effect/no effect assignment in the previous field is correct: range 0.0 to 1.0. (1.0 implies total confidence in the assignment)
Direction: Effect dependent categorical prediction of repressive: (submit a value -1) / activating (submit a value 1)/no effect (submit a value 0); i.e., allele effect is increasing reporter expression (activating), reducing expression (repressive), or not significantly different from zero.
P_Direction: Probability of a correct assignment in the previous field: range 0.5 to 1.0 (1.0 implies total confidence in the assignment, 0.5 implies that it is a random assignment).
Comments: Optional comment based on the predictions for this variant.

Submissions will be evaluated based on distance and correlation measures. Note that other evaluation criteria may also be used, at the assessor’s discretion.

In the template file, cells in columns 6-12 are marked with a dot ("."). Submit your predictions by replacing the "." with your value. No empty cells are allowed in the submission. You must submit predictions and confidence values for all the variants; if you are not confident in a prediction for a variant enter an appropriately small confidence value for the prediction. Optionally, enter a brief comment based on the prediction. If you do not enter a comment on a prediction, leave the "." in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model. If you are submitting a single file with all predictions combined, please use the format below.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description of your method, use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Download data

Download submission template file: lentimprasubmissiontemplate.tsv (available on Synapse)

Download training data: lentimpratrainingdata.tsv (available on Synapse)

Download submission validation script: lentimpravalidation.py (available on Synapse)

Dataset provided by

Arjun Devadas, Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany.

Ryan D Hernandez, Department of Bioengineering and Therapeutic Sciences, Cardiovascular Research Institute, Quantitative Biosciences Institute, University of California San Francisco, San Francisco CA, USA

Nadav Ahituv, Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California San Francisco, San Francisco CA, USA

Martin Kircher, Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany; Exploratory Diagnostic Sciences, Computational Genome Biology, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany

References

Agarwal V, et al. Massively parallel characterization of transcriptional regulatory elements. Nature (2025) 639(8054): 411-420. PubMed

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature (2015) 526(7571):68-74. PubMed

Gordon MG, et al. lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements. Nat Protoc (2020) 15(8):2387-2412. PubMed

Klein JC, et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat Methods (2020) 17(11):1083-1091. PubMed

Keukeleire P, et al. Using individual barcodes to increase quantification power of massively parallel reporter assays. BMC Bioinformatics (2025) 26(1):52. PubMed

Revision history

4 June 2025: challenge preview posted

25 June 2025: challenge open

15 September 2025: submission deadline extended from September 15 to September 30

30 September 2025: submission deadline extended from September 30 to October 15

19 October 2025: challenge closed on October 15

Center for Critical Assessment of Genome Interpretation

Register/Login

Critical Assessment of Genome Interpretation

How to participate in CAGI7? Download data & submit predictions on Synapse