Identify regulatory sequences and eQTL-causal variants, and estimate their effects on activation of transcription in a massively parallel reporter assay

Challenge: eQTL-causal SNPs
Dataset description: public
Variant data: registered users only, limited by CAGI Data use agreement
Last updated: 12 Apr 2016
This challenge closed at 9:00 PM PST (Pacific Standard Time) on 10 December 2015.


Download answer key, predictions, and assessment: registered users only, limited by CAGI Data use agreement
The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data use agreement. Please log in to access the file.


Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data use agreement
Presentation are accessible to registered users only, and their use is limited by the CAGI Data use agreement. Please log in to access the file.

Summary
Identifying the causal alleles responsible for variation in expression of human genes has been particularly difficult. This is an important problem, as genome-wide association studies (GWAS) suggest that much of the variation underlying common traits and diseases maps within regions of the genome that do not encode protein. A massively parallel reporter assay (MPRA) has been applied to thousands of single nucleotide polymorphisms (SNPs) and small insertion/deletion polymorphisms in linkage disequilibrium (LD) with cis-expression quantitative trait loci (eQTLs). The results identify variants showing differential expression between alleles. The challenge is to identify the regulatory sequences and the expression-modulating variants (emVars) underlying each eQTL and estimate their effects in the assay.

Background
GWAS and whole genome scans for natural selection have identified numerous loci linked to human traits and diseases, but correlation between nearby polymorphisms within individual associations often leaves dozens to hundreds of potential causal variants to be interrogated [1,2]. Mounting evidence suggests that at the majority of these loci, the causal variant(s) is a non-coding regulatory change rather than an amino acid substitution [3,4]. Indeed, regulatory changes drive some of the best-understood examples of phenotypic diversity and adaptive evolution [5,6].

The Sabeti lab has improved the sensitivity and reproducibility of the MPRA, adapting it to identify variants that directly modulate gene expression. They have applied the assay to 9,116 variants linked to 3,157 eQTLs. The eQTLs were drawn from the Geuvadis RNA-seq dataset of lymphoblastoid cell lines (LCLs) from individuals of European ancestry [7,8]. The assay detected eQTL-causal alleles with an estimated sensitivity of about 20% and a specificity of >99%. The identified causal variants include a number of well-annotated variants associated with diseases and traits.

With regard to the low sensitivity of the experiment, MPRA is designed to measure the activity of promoter and enhancer sequences; it is not designed to detect other sequences such as those modulating mRNA splicing or mRNA stability. A recent estimate is that less than 71% of eQTL-causal variants map within non-transcribed regions, with only 23% mapping to well defined promoter and enhancer regions [3]. In addition, MPRA is a synthetic assay system that measures the activity of short sequences cloned in an episome and thus isolated from their broader sequence and chromatin context. Therefore, the assay captures only a subset of mechanisms that likely cause regulatory differences. Note that the study on which this challenge is based has observed that the effects of emVars in the MPRA are only weakly correlated with the associated eQTL effects, and can often have the opposite sign. Thus, the eQTL effect (eQTL beta) is not a good predictor of the regulatory effect measured in the MPRA.

Experiment
The eQTL-associated variants were tested for their ability to drive changes in gene expression. Each eQTL region should have one or more variants that actually changes expression in its genomic context, and typically multiple other non-functional variants that are also associated because of LD with the causal variant. A subset of the eQTL-associated sequences tested in the assay results in significantly enhanced expression in the MPRA (i.e. Regulatory Hits), and a subset of these results in significant allele-specific difference in expression (emVars).

The MPRA was modified to increase throughput while also improving reproducibility and sensitivity. For each eQTL, all variants (range = 1 to 205, mean = 2.87, median = 1) in perfect LD with the top eQTL variant were identified. For each variant, a pair of 150-nt candidate regulatory sequence oligonucleotides was synthesized with the variant located at the central position (i.e. SNP at position 76). For insertion-deletion variants, the longer of the two alleles was designed as a 150-nt oligonucleotide; the shorter allele was then designed with the same flanking sequences as the longer allele (e.g., for a single-nucleotide InDel TC/C: 74N[TC]74N and 74N[T]74N). To accommodate a large library complexity and increase assay sensitivity, 20-nt barcodes were added to the oligos by emulsion PCR, such that each oligo is represented by an average of a thousand barcode tags within the plasmid pool. The barcoded oligos were cloned into a plasmid such that a minimal TATA-box promoter and a GFP reporter gene were inserted between the 150-bp test sequence and the 20-bp barcode.

The plasmid library was electroporated into two lymphoblastoid cell lines: five and three technical replicate experiments were performed using the cell lines NA12878 and NA19239, respectively. The assay results were averaged across the eight replicates. Twenty-four hours after transfection, the GFP reporter mRNA was captured by hybridization, and RNA sequencing of the 3'-UTR-adjacent barcodes was performed to quantify the influence of each 150-bp sequence on regulation of the reporter gene. RNA expression measured in barcode read counts was normalized relative to the input plasmid barcode counts determined by DNA sequencing. This new barcoding approach decreased inter-experimental noise and allowed the application a parametric statistical framework during analysis. Note that NA12878 is an ENCODE tier 1 cell line; results of a large number of genomic and epigenomic assays performed on this cell line may be found at https://genome.ucsc.edu/ENCODE/downloads.html.

Prediction challenge
The challenge is presented in two parts. Participants are encouraged but not required to submit predictions for both parts. To set up the challenge, we first divided the data into three subsets, each corresponding to potential regulatory sequences and variants associated with a representative subset of eQTLs.

The first subset of 3,044 variants associated with 1,052 eQTLs is provided as a sample. The sample dataset table provides the rsID, chromosome and position of the centered variant in hg19, and the sequences of the reference allele and alternate allele, for each potential regulatory sequence. It includes normalized plasmid counts (ctrl.mean), RNA counts (exp.mean), log2 fold expression level (log2FC; RNA/plasmid), expression p-value (logP), and BF corrected p-value (logPadj) for the reference allele (C.A) and the alternative allele (C.B.) from the combined LCL analysis (8 replicates). Next, it includes the log2 ratio (alternative/refrence) of expression (LogSkew.Comb), allelic skew p-value (C.Skew.logP), and allelic skew fdr (C.Skew.fdr). Next, the table includes the Regulatory_Hit (0 = no significant effect, 1 = regulatory effect) and emVar_Hit (0 = no significant effect, 1 = allelic skew) results. Finally, the table include the lead variant in the eQTL analysis, the eQTL associated gene, the eQTL beta, the eQTL t-statistic, and the eQTL p-value.

In the first part of the challenge, the 3,006 potential regulatory sequences and variants associated with a distinct subset of 1,050 eQTLs are provided. Participants are asked to submit predictions of the regulatory sequences that lead to significant activation of reporter gene expression (i.e. Regulatory Hit). Predictors should provide the probablity that each sequence is a Regulatory Hit and a predicted log2 expression level for the reference allele and the alternate allele (i.e. log2FC). In addition, each prediction must include a standard deviation indicating confidence; large SD means low confidence, and small SD means high confience in the submitted prediciton.

In the second part of the challenge, the 401 regulatory sequences associated with the third distinct subset of 1,055 eQTLs are provided (the 401 Regulatory Hits, excluding the 2,665 other potential regulatory sequences). Participants are asked to submit predictions of the subset of these containing an emVar (i.e. emVar Hit). Participants should provide the probablity that each variant is an emVar and predict the log2 ratio of the expression level of the alternative allele relative to the reference allele (i.e. LogSkew.Comb). In addition, each prediction must include a standard deviation indicating confidence; large SD means low confidence, and small SD means high confience in the submitted prediciton.

Download sample dataset
This dataset file is available only to registered users. Please log in to access the file.

Download challenge part 1 dataset
This dataset file is available only to registered users. Please log in to access the file.

Download part 1 submission template
This submission template file is available only to registered users. Please log in to access the file.

Download part 1 submission validation script
This submission validation script is available only to registered users. Please log in to access the file.

Download challenge part 2 dataset
This dataset file is available only to registered users. Please log in to access the file.

Download part 2 submission template
This submission template file is available only to registered users. Please log in to access the file.

Download part 2 submission validation script
This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format
The prediction submission for each part of the challenge is a tab-delimited text file. Organizers provide template files, which must be used for submission. In addition, a validation scripts are provided, and predictors should check the correctness of the formats before submitting their predictions.

In the submitted file for part 1, each row should include the following columns:

  1. ID – The identifier for the potential regulatory sequence (containing the reference and alternative alleles) as listed in the prediction dataset file.
  2. C.A.log2FC (expression level, reference allele) – Predicted log2 expression level of reporter expression of the reference allele
  3. SD – standard deviation of the prediction in column 2
  4. C.B.log2FC (expression level, alternate allele) – Predicted log2 expression level of reporter expression of the alternate allele
  5. SD – standard deviation of the prediction in column 4
  6. Regulatory_Hit – Estimated probability that the sequences have significant regulatory activity (0 = no activity, 1 = regulatory activity)
  7. SD – standard deviation of the prediction in column 6

In the submitted file for part 2, each row must include the following columns:

  1. ID – The identifier for the potential regulatory sequence (containing the reference and alternative alleles) as listed in the prediction dataset file.
  2. LogSkew.Comb (allelic skew) – Predicted log2 expression ratio of reporter expression of the alternative allele relative to the reference allele
  3. SD – standard deviation of the prediction in column 2
  4. emVar_Hit (expression-modulating variant) – Estimated probability that the variant results in a significant allele-specific difference in regulatory activity (0 = no allelic skew, 1 = allelic skew)
  5. SD – standard deviation of the prediction in column 4
  6. Comment – optional brief comment on the basis of the predictions in columns 2 an

In the template files, cells in columns 2-8 (part 1) or 2-6 (part 2) are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submissions. You may submit predictions for part 1, part 2, or both parts of the challenge. You must enter a prediction and standard deviation for every prediction field (columns) and every sequence (rows); if you are not confident in a prediction, enter a large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the predictions for a sequence;,otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.

References

  1. Grossman, S.R., et al. (2013). Identifying recent adaptations in large-scale genomic data. Cell 152(4):703-13.
  2. Schaub, M., Boyle, A., Kundaje, A., Batzoglou, S., and Snyder, M. (2012). Linking disease associations with regulatory information in the human genome. Genome Res. 22:1748-59.
  3. Farh, K., et al. (2015). Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518:337-43.
  4. Maurano, M., et al. (2012). Systematic localization of common disease-associated variation in regulatory DNA. Science 337:1190-5.
  5. Musunuru, K., et al. (2010). From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466:714-9.
  6. Tishkoff, S., et al. (2006). Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39:31-40.
  7. The 1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65.
  8. Lappalainen, T., et al. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501:506-11.

Data provided by
ryan pardis
Ryan Tewhey and Pardis Sabeti, Broad Institute

Updates
4 Sep 2015 (v01): initial release
28 Oct 2015 (v02): submission instructions and templates updated, validation scripts provided
30 Oct 2015 (v03): challenge description edited to correct errors in the numbers of variants and eQTLs in the dataset files; no changes to the dataset files
5 Nov 2015 (v04): updated sample dataset with edits to two variants
12 Nov 2015 (v05): improved validation script provided
17 Nov 2015 (v06): oligonucleotide design for InDel variants clarified
8 Jan 2016 (v07): answer key provided
18 Mar 2016 (v08): predictions provided
7 Apr 2016 (v09): conference presentations provided
12 Apr 2016 (v10): assessment provided