Predicting individual non-coding variant effects in disease associated promoter and enhancer elements

Challenge: Regulation Saturation
Dataset description: public
Dataset availability: Registered users only
Last updated: 24 Apr 2018
This challenge will tentatively close at 8:00 PM PST (Pacific Standard Time) on 3 May 2018.

Download answer key and predictions : registered users only, limited by CAGI Data use agreement

The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data use agreement
Presentations are accessible to registered users only, and their use is limited by the CAGI Data use agreement. Please log in to access the file.

Due to technical reasons, the dataset includes positions 20bp up- and downstream of each construct which should have been removed. Please disregards the following positions:  F9:X  138612621; GP1BB:22      19710788; HBG1:11       5271309; HNF4A:20      42984159; IRF6:1        209989736;  MSMB:10       51548987; PKLR:1        155271656; SORT1:1       109817273; HBB:11  5248439; IRF4:6  396142; LDLR:19 11199906; MYCrs6983267:8  128413073; ZFAND3:6        37775274

17,500 single nucleotide variants (SNVs) in 5 human disease associated enhancers (including IRF4, IRF6, MYC, SORT1) and 9 promoters (including TERT, LDLR, F9, HBG1) were assessed in a saturation mutagenesis massively parallel reporter assay. Promoters were cloned into a plasmid upstream of a tagged reporter construct, and reporter expression was measured relative to the plasmid DNA to determine the impact of promoter variants. Enhancers were placed upstream of a minimal promoter and assayed similarly. The challenge is to predict the functional effects of these variants in the regulatory regions as measured from the reporter expression.

Non-coding variants are known to play an important role in a number of diseases, particularly complex trait disease and cancer. Although a number of specific disease relevant variants have been identified, these provide insufficient data to thoroughly test computational methods that aim to find such variants.

Kircher and collaborators generated variant-specific activity maps for 9 clinically relevant promoters and 5 clinically relevant enhancer sequences in a plasmid-based massively parallel reporter assay (MPRA). The approach was validated by replication and by confirming that previously known disease-causing variants were found. For example, all previously known activating mutations in the TERT core promoter were rediscovered. The dataset includes quantitative expression measurements for more than 17,500 SNVs. These experiments provide a rich dataset for benchmarking predictive models of non-coding variant effects, a database for the interpretation of potentially disease-causing regulatory mutations, and the potential for providing critical insight that will aid the development of improved computational tools.

Briefly, the enhancers are characterized by a reporter assay that links a candidate enhancer sequence to a minimal promoter (a promoter that results in minimal reporter expression without a functional enhancer) and a reporter gene (e.g., GFP, LacZ, luciferase or others) whose RNA and DNA includes a unique sequence (tag). The reporter vectors are introduced into cell lines, and the reporter gene expression for each variant is examined relative to the amount of its plasmid DNA. If the candidate sequence acts as an enhancer, it will increase promoter activity and the reporter gene expression in the tissue/cell type of interest. High-throughput sequencing was carried out to determine the function of each enhancer variant (Inoue & Ahituv, 2015) by measuring abundance of the tag in RNA and DNA, as well as tag association with candidate enhancer sequence variants.

Like enhancers, promoter candidate sequences are also cloned into a plasmid upstream of a tagged reporter, and reporter expression is again measured relative to the plasmid DNA to determine the impact of promoter variants (Patwardhan et al., 2009; Tewhey et al., 2016).

In detail, the underlying MPRA libraries (50k-2M) were derived from PCR-based saturation mutagenesis of regulatory regions of up to 600bp length. Changes to functional sequences with a rate of 1 per 100 bases were created randomly. The resulting PCR products with approximately 1/100 changes compared to the template region were integrated in a plasmid library containing random tag sequences. Promoter and enhancer sequences of interest (Table 1 & 2) were confirmed by Sanger sequencing. RNA and DNA was pooled, gel-purified, and sequenced. Paired End sequencing of the RNA and DNA for each of three replicates were sequenced on an Illumina NextSeq instrument to derive variant-specific activity maps for the promoter and enhancer sequences. The counts of transcribed RNA barcodes were put in relation to DNA barcode counts of the transfected plasmid library. The relative abundance of each tag provides a digital readout of the transcriptional efficiency of its cis-linked mutant promote or enhancer.

Specifically, a multiple linear regression model of log2(RNA) ~ log2(DNA) + N + offset (were RNA and DNA are counts observed for all n tags, N is a binary matrix associating n tags to m sequence variants and offset is the models' offset normalizing total DNA to RNA counts) was used to assign sequence effects. From the fit, m coefficients (corresponding to the columns of matrix N) were obtained and assigned as the effects of each sequence variant (SNV). A confidence score was derived by capping the p-value of the multiple linear regression at 10-50 and scaling the log10-transformed value between 0 and 1 (i.e. 1 corresponding to a p-value of ≤10-50, 0.5 of 10-25, 0 to a p-value of 1). We deemed a confidence score greater or equal to 0.1 (p-value of 10-5) indicates that the SNV "has an expression effect" while <0.1 means "has no expression effect".

Table 1: Tested promoter sequences.

PromoterGenomic coordinates (GRCh38/hg19)Associated PhenotypeVectorCell line Transf. time (hr)Fold ChangeConstruct size (bp)
F9chrX:138612624-138612923Hemophilia BpGL4.11bHepG2242.6300
GP1BBchr22:19710790-19711173Bernard-Soulier SyndromepGL4.11bHEL 92.1.72422.1384
HBB chr11:5248252-5248438ThalassemiapGL4.11bHEL 92.1.72414.3187
HBG chr11:5271035-5271308Hereditary persistence of fetal hemoglobinpGL4.11bHEL 92.1.724118.1274
HNF4A (P2)chr20:42984160-42984444Maturity-onset diabetes of the young (MODY)pGL4.11bHEK293T242.8285
LDLRchr19:11199907-11200224 Familial hypercholesterolemiapGL4.11bHepG224110.7318
MSMBchr10:51548988-51549578Prostate cancerpGL4.11bHEK293T248.4591
PKLRchr1:155271187-155271655Pyruvate kinase deficiencypGL4.11bK5624829.4469
TERTchr5:1295105-1295362Various types of cancerpGL4.11bHEK293T, GBM24231.8258

Table 2: Tested enhancer sequences.

EnhancerGenomic coordinates (hg19)Phenotype caused by mutationsVectorCell line Transf. time (hr)Fold ChangeConstruct size (bp)
IRF4chr6:396143-396593 Human pigmentationpGL4.23SK-MEL-282444.5451
IRF6chr1:209989135-209989734 Cleft lippGL4.23HaCaT2417.0600
MYCchr8:128413074-128413673 Various types of cancer (rs6983267)pGL4.23HEK293T32, 20nM LiCl added after 24hr0.8600
SORT1chr1:109817274-109817873Plasma low-density lipoprotein cholesterol & myocardial infractionpGL4.23HepG224235.3600
ZFAND3 chr6:37775276-37775853 Type 2 diabetespGL4.23MIN62414.3578

Dataset description
Training data: A tab delimited file of variant effects for about 25% of all measured alleles is provided. Training data set files have the following columns:

Test data set files have the following columns:

  1. Chrom - GRCh37/hg19 chromosome name
  2. Pos - -Chromosomal position (1-based)
  3. Ref - -Reference allele
  4. Alt - Alternate allele
  5. Direction - Estimated variant effect from the experimental data. Categorical prediction of repressive. (submitted value -1)/activating (submit a value 1)/no effect (submit a value 0), i.e. allele effect is increasing reporter expression (activating), reducing expression (repressive), or not significantly different from zero (confidence score below 0.1).
  6. Confidence - Confidence score derived from the p-value of the linear regression fit to the experimental data. A ‘significant expression effect’ in the provided data is one assigned a confidence score greater than or equal to 0.1.
Additional columns in the training data are columns derived from Value and Confidence to match the information requested in the submission format.

Prediction submission format
For the remaining 75% of the data, participants are asked to submit their predictions based on the chromosome, position (1-based) and the reference and alternate alleles.

The prediction submission is a tab-delimited text file. The organizers provide a template file for all promoters and enhancers, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Each data row in the submitted file must include the following columns:

  1. Chrom - GRCh37/hg19 chromosome name
  2. Pos - Chromosomal position (1-based)
  3. Ref - Reference allele
  4. Alt - Alternate allele
  5. Promoter_Enhancer
  6. Direction - Effect of the variant: Repressive = submitted value of -1; activating = submit a value of 1; no effect = submit a value 0), i.e. allele effect is increasing reporter expression (activating), reducing expression (repressive), or not significantly different from zero.
  7. P_Direction -Probability of a correct assignment in the previous field: range 0.0 to 1.0 (1.0 implies total confidence in the assignment, 0 implies that it is a random assignment).
  8. Confidence - Prediction of the Kircher et al’s confidence score which is a re-scale of the p-value fit to the experimental data. A ‘significant expression effect’ in the provided data is one assigned a confidence score greater than or equal to 0.1.
  9. SD - Standard error of the value in the preceding field.
  10. Comments - Optional comment on the basis of the predictions for this variant

Note that other evaluation criteria may also be used, at the assessor’s discretion.

In the template file, cells in columns 5-9 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard errors for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Additional information Note that many predictions of repressive effects may be small, i.e. not significantly different from a zero effect size. In consideration of this, the challenge will be evaluated in multiple ways, as described. Predictors should also note that expression effects are evaluated in a plasmid system and not the full regulatory environment of the cell.

Table 3 provides correlation coefficients obtained from quantifying effects across the three experimental replicates. These values will serve as the upper limit of score correlations obtained in analyses 9 and 11 and correlation coefficients will be rescaled using these values to correct for differences in experimental noise between different loci

Table 3: Pearson correlation of variant effect estimates obtained from fitting SNV effects across the three experimental replicates each.

NameReplicate correlation
MYC (rs6983267)0.55
TERT (GBM)0.90
TERT (HEK293T)0.65

Download Figure and tables
This dataset file is available only to registered users. Please log in to access the file.

Figure 1

Download training dataset
This dataset file is available only to registered users. Please log in to access the file.

Download dataset
This dataset file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.


  1. Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, Harrell TM, McMillin MJ, Wiszniewski W, Gambin T, Akdemir ZH. 2015. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am J Hum Genet 97:199-215. PMCID:PMC4573249. doi:10.1016/j.ajhg.2015.06.009
  2. Petersen B-S, Fredrich B, Hoeppner MP, Ellinghaus D, Franke A. 2017. Opportunities and challenges of whole-genome and-exome sequencing. BMC Genet 18:14. PMCID:PMC5307692. doi:10.1186/s12863-017-0479-5
  3. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. 2014. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46:310-315. PMCID:PMC3992975. doi:10.1038/ng.2892
  4. Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12:931-934. PMCID:PMC4768299. doi:10.1038/nmeth.3547
  5. Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. 2016. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet 48:214-220. PMCID:PMC4731313. doi:10.1038/ng.3477
  6. Shihab HA, Rogers MF, Gough J, Mort M, Cooper DN, Day IN, Gaunt TR, Campbell C. 2015. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. PMCID: PMC4426838 DOI: 10.1093/bioinformatics/btv009
  7. Fu Y, Liu Z, Lou S, Bedford J, Mu XJ, Yip KY, Khurana E, Gerstein M. 2014. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 15:480. PMCID:PMC4203974. doi:10.1186/s13059-014-0480-5
  8. Ritchie GR, Dunham I, Zeggini E, Flicek P. 2014. Functional annotation of noncoding sequence variants. Nat Methods 11:294-296. PMCID:PMC5015703. doi:10.1038/nmeth.2832
  9. Smedley D, Schubach M, Jacobsen JO, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, Haendel MA. 2016. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet 99:595-606. PMCID:PMC5011059. doi:10.1016/j.ajhg.2016.07.005
  10. Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee SI, Cooper GM, Ahituv N. 2012. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30:265-270. PMCID:PMC3402344. doi:10.1038/nbt.2136
  11. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. PMCID: PMC2705234 doi:10.1093/bioinformatics/btp324
  12. Li H. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987-2993. PMCID:PMC3198575. doi:10.1093/bioinformatics/btr509
  13. Inoue F, Kircher M, Martin B, Cooper GM, Witten DM, McManus MT, Ahituv N, Shendure J. 2017. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res 27:38-52. PMCID:PMC5204343. doi:10.1101/gr.212092.116
  14. Kircher M. 2012. Analysis of high-throughput ancient DNA sequencing data. Ancient DNA: methods and protocols. Methods Mol Biol 840:197-228. doi:10.1007/978-1-61779-516-9_23

Dataset provided by

M.Kircher I.Fumitaka X.Chenling B.Martin N.Ahituv J.Shendure

Martin Kircher(1,2), Fumitaka Inoue(3), Chenling Xiong(3), Beth Martin(2), Nadav Ahituv(3), Jay Shendure(2)

1.Translational Genomics Center, Berlin Institute of Health, Berlin, Germany 2.Department of Genome Sciences, University of Washington, Seattle WA, USA 3.Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California San Francisco, San Francisco CA, USA

Revision History
4 Jan 2018 (v01): initial release
16 Apr 2018 (v01): Note added and new challenge closure date
24 Sep 2018 (v02): Dataset availability added