CAGI Scientist Opening: apply here ☀️

Predicting individual non-coding variant effects in disease associated promoter and enhancer elements

Challenge: Regulation Saturation

Dataset description: public

Dataset availability: registered users only

Last updated: 24 April 2018

This challenge closes at 8:00 PM PST (Pacific Standard Time) on 3 May 2018.

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


Due to technical reasons, the dataset includes positions 20bp up- and downstream of each construct which should have been removed. Please disregards the following positions:  F9:X 138612621; GP1BB:22 19710788; HBG1:11 5271309; HNF4A:20 42984159; IRF6:1 209989736; MSMB:10 51548987; PKLR:1 155271656; SORT1:1 109817273; HBB:11 5248439; IRF4:6 396142; LDLR:19 11199906; MYCrs6983267:8 128413073; ZFAND3:6 37775274 


17,500 single nucleotide variants (SNVs) in 5 human disease associated enhancers (including IRF4, IRF6, MYC, SORT1) and 9 promoters (including TERT, LDLR, F9, HBG1) were assessed in a saturation mutagenesis massively parallel reporter assay. Promoters were cloned into a plasmid upstream of a tagged reporter construct, and reporter expression was measured relative to the plasmid DNA to determine the impact of promoter variants. Enhancers were placed upstream of a minimal promoter and assayed similarly. The challenge is to predict the functional effects of these variants in the regulatory regions as measured from the reporter expression.


Non-coding variants are known to play an important role in a number of diseases, particularly complex trait disease and cancer. Although a number of specific disease relevant variants have been identified, these provide insufficient data to thoroughly test computational methods that aim to find such variants.

Kircher and collaborators generated variant-specific activity maps for 9 clinically relevant promoters and 5 clinically relevant enhancer sequences in a plasmid-based massively parallel reporter assay (MPRA). The approach was validated by replication and by confirming that previously known disease-causing variants were found. For example, all previously known activating mutations in the TERT core promoter were rediscovered. The dataset includes quantitative expression measurements for more than 17,500 SNVs. These experiments provide a rich dataset for benchmarking predictive models of non-coding variant effects, a database for the interpretation of potentially disease-causing regulatory mutations, and the potential for providing critical insight that will aid the development of improved computational tools.


Briefly, the enhancers are characterized by a reporter assay that links a candidate enhancer sequence to a minimal promoter (a promoter that results in minimal reporter expression without a functional enhancer) and a reporter gene (e.g., GFP, LacZ, luciferase or others) whose RNA and DNA includes a unique sequence (tag). The reporter vectors are introduced into cell lines, and the reporter gene expression for each variant is examined relative to the amount of its plasmid DNA. If the candidate sequence acts as an enhancer, it will increase promoter activity and the reporter gene expression in the tissue/cell type of interest. High-throughput sequencing was carried out to determine the function of each enhancer variant (Inoue & Ahituv, 2015) by measuring abundance of the tag in RNA and DNA, as well as tag association with candidate enhancer sequence variants.

Like enhancers, promoter candidate sequences are also cloned into a plasmid upstream of a tagged reporter, and reporter expression is again measured relative to the plasmid DNA to determine the impact of promoter variants (Patwardhan et al., 2009; Tewhey et al., 2016).

In detail, the underlying MPRA libraries (50k-2M) were derived from PCR-based saturation mutagenesis of regulatory regions of up to 600bp length. Changes to functional sequences with a rate of 1 per 100 bases were created randomly. The resulting PCR products with approximately 1/100 changes compared to the template region were integrated in a plasmid library containing random tag sequences. Promoter and enhancer sequences of interest (Table 1 & 2) were confirmed by Sanger sequencing. RNA and DNA was pooled, gel-purified, and sequenced. Paired End sequencing of the RNA and DNA for each of three replicates were sequenced on an Illumina NextSeq instrument to derive variant-specific activity maps for the promoter and enhancer sequences. The counts of transcribed RNA barcodes were put in relation to DNA barcode counts of the transfected plasmid library. The relative abundance of each tag provides a digital readout of the transcriptional efficiency of its cis-linked mutant promote or enhancer.

Specifically, a multiple linear regression model of log2(RNA) ~ log2(DNA) + N + offset (were RNA and DNA are counts observed for all n tags, N is a binary matrix associating n tags to m sequence variants and offset is the models' offset normalizing total DNA to RNA counts) was used to assign sequence effects. From the fit, m coefficients (corresponding to the columns of matrix N) were obtained and assigned as the effects of each sequence variant (SNV). A confidence score was derived by capping the p-value of the multiple linear regression at 10-50 and scaling the log10-transformed value between 0 and 1 (i.e. 1 corresponding to a p-value of ≤10-50, 0.5 of 10-25, 0 to a p-value of 1). We deemed a confidence score greater or equal to 0.1 (p-value of 10-5) indicates that the SNV "has an expression effect" while <0.1 means "has no expression effect".

Table 1: Tested promoter sequences.

Promoter    Genomic coordinates (GRCh38/hg19)    Associated Phenotype    Vector    Cell line    Transf. time (hr)    Fold Change    Construct size (bp)

F9    chrX:138612624-138612923    Hemophilia B    pGL4.11b    HepG2    24    2.6    300

GP1BB    chr22:19710790-19711173    Bernard-Soulier Syndrome    pGL4.11b    HEL 92.1.7    24    22.1    384

HBB    chr11:5248252-5248438    Thalassemia    pGL4.11b    HEL 92.1.7    24    14.3    187

HBG    chr11:5271035-5271308    Hereditary persistence of fetal hemoglobin    pGL4.11b    HEL 92.1.7    24    118.1    274

HNF4A (P2)    chr20:42984160-42984444    Maturity-onset diabetes of the young (MODY)    pGL4.11b    HEK293T    24    2.8    285

LDLR    chr19:11199907-11200224    Familial hypercholesterolemia    pGL4.11b    HepG2    24    110.7    318

MSMB    chr10:51548988-51549578    Prostate cancer    pGL4.11b    HEK293T    24    8.4    591

PKLR    chr1:155271187-155271655    Pyruvate kinase deficiency    pGL4.11b    K562    48    29.4    469

TERT    chr5:1295105-1295362    Various types of cancer    pGL4.11b    HEK293T, GBM    24    231.8    258

Table 2: Tested enhancer sequences.

Enhancer    Genomic coordinates (hg19)    Phenotype caused by mutations    Vector    Cell line    Transf. time (hr)    Fold Change    Construct size (bp)

IRF4    chr6:396143-396593    Human pigmentation    pGL4.23    SK-MEL-28    24    44.5    451

IRF6    chr1:209989135-209989734    Cleft lip    pGL4.23    HaCaT    24    17.0    600

MYC    chr8:128413074-128413673    Various types of cancer (rs6983267)    pGL4.23    HEK293T    32, 20nM LiCl added after 24hr    0.8    600

SORT1    chr1:109817274-109817873    Plasma low-density lipoprotein cholesterol & myocardial infraction    pGL4.23    HepG2    24    235.3    600

ZFAND3    chr6:37775276-37775853    Type 2 diabetes    pGL4.23    MIN6    24    14.3    578

Dataset description

Training data: A tab delimited file of variant effects for about 25% of all measured alleles is provided. Training data set files have the following columns:

Test data files have the following columns:

Additional columns in the training data are columns derived from Value and Confidence to match the information requested in the submission format. 

Prediction submission format 

For the remaining 75% of the data, participants are asked to submit their predictions based on the chromosome, position (1-based) and the reference and alternate alleles.

The prediction submission is a tab-delimited text file. The organizers provide a template file for all promoters and enhancers, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Each data row in the submitted file must include the following columns:

Note that other evaluation criteria may also be used, at the assessor’s discretion.

In the template file, cells in columns 5-9 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard errors for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Additional information

Note that many predictions of repressive effects may be small, i.e. not significantly different from a zero effect size. In consideration of this, the challenge will be evaluated in multiple ways, as described. Predictors should also note that expression effects are evaluated in a plasmid system and not the full regulatory environment of the cell.

Table 3 provides correlation coefficients obtained from quantifying effects across the three experimental replicates. These values will serve as the upper limit of score correlations obtained in analyses 9 and 11 and correlation coefficients will be rescaled using these values to correct for differences in experimental noise between different loci

Table 3: Pearson correlation of variant effect estimates obtained from fitting SNV effects across the three experimental replicates each.

Name                          Replicate correlation

F9                                0.61

GP1BB                        0.74

HBB                             0.62

HBG1                          0.78

HNF4A                        0.75

IRF4                             0.98

IRF6                             0.90

LDLR                           0.99

MSMB                         0.75

MYC (rs6983267)     0.55

PKLR                           0.79

SORT1                         0.98

TERT (GBM)               0.90

TERT (HEK293T)       0.65

ZFAND3                      0.72

Download Figure and tables: This dataset file is available only to registered users. Please log in to access the file.

Figure 1

Download training dataset: This dataset file is available only to registered users. Please log in to access the file.

Download dataset: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Dataset provided by

Martin Kircher (1, 2), Fumitaka Inoue (3), Chenling Xiong (3), Beth Martin (2), Nadav Ahituv (3), Jay Shendure (2)

1.Translational Genomics Center, Berlin Institute of Health, Berlin, Germany 2.Department of Genome Sciences, University of Washington, Seattle WA, USA 3.Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California San Francisco, San Francisco CA, USA


Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, Harrell TM, McMillin MJ, Wiszniewski W, Gambin T, Akdemir ZH. 2015. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am J Hum Genet 97:199-215. PMCID:PMC4573249. doi:10.1016/j.ajhg.2015.06.009

Petersen B-S, Fredrich B, Hoeppner MP, Ellinghaus D, Franke A. 2017. Opportunities and challenges of whole-genome and-exome sequencing. BMC Genet 18:14. PMCID:PMC5307692. doi:10.1186/s12863-017-0479-5

Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. 2014. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46:310-315. PMCID:PMC3992975. doi:10.1038/ng.2892

Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12:931-934. PMCID:PMC4768299. doi:10.1038/nmeth.3547

Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. 2016. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet 48:214-220. PMCID:PMC4731313. doi:10.1038/ng.3477

Shihab HA, Rogers MF, Gough J, Mort M, Cooper DN, Day IN, Gaunt TR, Campbell C. 2015. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. PMCID: PMC4426838 DOI: 10.1093/bioinformatics/btv009

Fu Y, Liu Z, Lou S, Bedford J, Mu XJ, Yip KY, Khurana E, Gerstein M. 2014. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 15:480. PMCID:PMC4203974. doi:10.1186/s13059-014-0480-5

Ritchie GR, Dunham I, Zeggini E, Flicek P. 2014. Functional annotation of noncoding sequence variants. Nat Methods 11:294-296. PMCID:PMC5015703. doi:10.1038/nmeth.2832

Smedley D, Schubach M, Jacobsen JO, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, Haendel MA. 2016. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet 99:595-606. PMCID:PMC5011059. doi:10.1016/j.ajhg.2016.07.005

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee SI, Cooper GM, Ahituv N. 2012. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30:265-270. PMCID:PMC3402344. doi:10.1038/nbt.2136

Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. PMCID: PMC2705234 doi:10.1093/bioinformatics/btp324

Li H. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987-2993. PMCID:PMC3198575. doi:10.1093/bioinformatics/btr509

Inoue F, Kircher M, Martin B, Cooper GM, Witten DM, McManus MT, Ahituv N, Shendure J. 2017. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res 27:38-52. PMCID:PMC5204343. doi:10.1101/gr.212092.116

Kircher M. 2012. Analysis of high-throughput ancient DNA sequencing data. Ancient DNA: methods and protocols. Methods Mol Biol 840:197-228. doi:10.1007/978-1-61779-516-9_23

Revision history 

4 Jan 2018 (v01): initial release 

16 Apr 2018 (v01): Note added and new challenge closure date 

24 Sep 2018 (v02): Dataset availability added