CAGI Scientist Opening: apply here ☀️

Distinguish between exomes of Crohn's disease patients and healthy individuals

Challenge: eQTL-causal SNPs

Dataset description: public

Variant data: registered users only, limited by CAGI Data Use Agreement

Last updated: 14 April 2016

This challenge closed at 9:00 PM PST (Pacific Standard Time) on 9 December 2015.

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


Crohn’s disease (CD [MIM 266600]) is a chronic inflammatory bowel disease (IBD) characterized by relapsing inflammation that can involve any part of the gastrointestinal tract and also extra-intestinal manifestations. It is caused by the complex interplay between an overly active immune system and environmental triggers in genetically susceptible individuals. Results from twin and familial aggregation studies [1], as well as evidence from genome-wide association studies [2,3], have shown that genetic factors play an important role in CD etiology. The disease burden of CD is high, often associated with complications such as intestinal ulcers, fistulas, and increased risk of colon cancer. Accumulating evidence suggests that immune tolerance to normal intestinal bacteria is disturbed in genetically susceptible individuals [4], leading to a pathogenic inflammatory response. NOD2 (nucleotide-binding oligomerization domain 2; alias, CARD15) was the first susceptibility gene identified for Crohn's disease and is involved in the innate immune response directed against components of the bacterial cell wall [5,6]. To date, 163 genetic susceptibility loci have been identified for IBD with 30 loci exclusive to CD, 23 to ulcerative colitis (UC), and 110 shared by the two [3]. Early-onset cases of IBD, with an age of onset before 10, often show a more severe disease course with a higher risk of complications, and genetic factors likely play a larger role in these individuals.

The exome sequence variants provided here were collected in Germany from unrelated CD patients and unrelated healthy individuals. Some of the CD patients developed the disease before their 10th year of life, corresponding to early-onset IBD [7].

Prediction challenge

With the provided exome data, identify which individuals have Crohn's disease and which individuals are healthy. In addition, identify which of the patients developed CD before age 10. Data from the Crohn's disease challenges in CAGIs 2011 and 2013 may be used for training. In addition, we provide age of onset data for the previous challenges, below.

Dataset description

Exome sequencing was performed using the TruSeq exome enrichment kit (Illumina) and the Illumina HiSeq2000 instrument. Reads were mapped to the human genome build hg19, and variants were called for all 111 exomes together using the Genome Analysis Toolkit (GATK version 3.3-0) Haplotype Caller. Variant calls were restricted to the TruSeq exome target. GATK was also used for variant quality score recalibration, and only high quality variants passing the filters were retained. Further details of data processing are available in the header of the provided vcf file.

Download dataset: This dataset file is available only to registered users. Please log in to access the file.

Download age of onset training data: This dataset file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script will be provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row should include the following tab-separated fields:

In the template file, cells in columns 2-6 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You may enter predictions of disease status, disease onset, or both subchallenges. If you choose not to enter predictions for a subchallenge, leave the "*" in those cells. Within a subchallenge, you must enter a prediction and standard deviation for every individual; if you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter brief comments indicating the basis of the predictions; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.


Dataset provided by

Prof. Dr. Andre Franke & Dr Britt-Sabina Petersen

Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Germany


3 Aug 2015 (v01): initial release 

4 Sep 2015 (v02): challenge close date added 

23 Oct 2015 (v03): age of onset for Crohn's exomes challenges in CAGIs 2011 and 2013 provided

28 Oct 2015 (v04): submission instructions and template updated, validation script provided

7 Nov 2015 (v05): submission deadline extended

12 Nov 2015 (v06): improved validation script provided

18 Dec 2015 (v07): answer key provided

18 Mar 2016 (v08): predictions provided 

14 Apr 2016 (v09): assessment and conference presentations provided