CAGI Scientist Opening: apply here ☀️

Distinguish between exomes of bipolar disorder patients and unaffected individuals

Challenge: Bipolar exomes

Dataset description: public

Variant data: registered users only, limited by CAGI Data Use Agreement

Last updated: 14 April 2016

This challenge closed at 9:00 PM PST (Pacific Standard Time) on 25 January 2016.

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


Bipolar disorder (BD) is a serious mental illness characterized by recurrent episodes of manias and depression, which are syndromes of abnormal mood, thinking and behavior. It affects 1.0-4.5% of the population [1], and it is among the major causes of disability worldwide [2]. Notably, up to 15% of individuals with BD die from suicide [3]. There is overwhelming evidence that genetic factors play a leading role in the etiology of BD [4-9]. Twin studies of BD have yielded estimates of heritability that reach up to 90% [10] and are among the highest for any mental disorder [11,12]. Moreover, there is little evidence that other non-genetic factors contribute significantly to the risk of BD [13]. However, progress in explaining the genetic contribution to BD has been limited. Nearly two decades of research with linkage and candidate gene studies failed to identify a single confirmed susceptibility gene for BD [10,14].

More recent efforts with genome-wide association studies (GWAS) have had greater success. The most recent report on BD from the Psychiatric Genomics Consortium (PGC), an international consortium of investigators carrying out meta-analyses of existing GWAS on five different psychiatric disorders [15], identified 19 genome-wide significant loci in an analysis of 20,352 BD cases and 31,358 controls. The findings emerging from the PGC provide for the first time credible evidence implicating specific genetic loci in the risk of BD. However, the number of genome-wide significant findings that have been reported for BD are considerably less than for schizophrenia in which over 100 genome-wide significant loci have been identified [16]. The differences are notable even considering the larger sample sizes available for analysis with schizophrenia. Moreover, estimates of the total variance in liability to BD that can be explained by all SNPs in the GWAS carried out by the PGC do not exceed 25% [17]. These observations support the hypothesis that rarer variants that are poorly tagged and largely missed by GWAS contribute to a meaningful proportion of risk for BD [18-20] and motivate sequencing studies to identify such variants.

Dataset description

The data providers have carried out an exome sequencing study of BD and provide for this challenge data from 1,000 individuals including cases, and ancestry and sex matched controls. NimbleGen SeqCap EZ v2.0 Exome arrays with ~3.4 Mb additional custom target for promoter, UTR, and intronic information of 1,422 synaptic genes and 60 genes previously associated with BD were used for target capture (see bed file, Additional information) and samples were sequenced using the Illumina HiSeq2000. Variants were called using GATK UnifiedGenotyper with all samples together following best practices. Only high quality pass variants were retained. We further excluded variants with greater than 10% missing data or in Hardy-Weinberg disequilibrium at p<1x10-6, as well as specific genotype calls with read depth<10 or genotype quality<20. 

Prediction challenge

With the provided exome data, identify which individuals have BD and which individuals are unaffected. The organizers have divided the dataset into halves: 500 exomes for training, and 500 exomes for the prediction challenge. The 1,000 exomes are provided in a single dataset file. A training set file provides the BD status of 500 individuals: it contains the individual's ID and their disease status (0=unaffected, 1=BD).

Download dataset, training data, and DTA: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row should include the following tab-separated fields:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a prediction and standard deviation for every individual; if you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter a brief comment indicating the basis of the prediction; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Additional information

Download the target intervals used for extended target capture: This dataset file is available only to registered users. Please log in to access the file.


Data provided by 

Mehdi Pirooznia and Peter Zandi, The Johns Hopkins University 

Richard McCombie, Cold Spring Harbor Laboratory 

James B. Potash, University of Iowa 


4 Nov 2015 (v01): initial release 

11 Nov 2015 (v02): DTA provided as a pdf form with fields 

12 Nov 2015 (v03): improved validation script provided 

17 Dec 2015 (v04): challenge close date extended to 25 Jan 2015 

29 Jan 2016 (v05): answer key provided 

18 Mar 2016 (v06): predictions provided 

14 Apr 2016 (v07): assessment and conference presentations provided