Crohn’s disease (CD [MIM 266600]) is a chronic inflammatory bowel disease (IBD) characterized by relapsing inflammation that can involve any part of the gastrointestinal tract and also extra-intestinal manifestations. It is caused by the complex interplay between an overly active immune system and environmental triggers in genetically susceptible individuals. Results from twin and familial aggregation studies [1], as well as evidence from genome-wide association studies [2,3], have shown that genetic factors play an important role in CD etiology. The disease burden of CD is high, often associated with complications such as intestinal ulcers, fistulas, and increased risk of colon cancer. Accumulating evidence suggests that immune tolerance to normal intestinal bacteria is disturbed in genetically susceptible individuals [4], leading to a pathogenic inflammatory response. NOD2 (nucleotide-binding oligomerization domain 2; alias, CARD15) was the first susceptibility gene identified for Crohn's disease and is involved in the innate immune response directed against components of the bacterial cell wall [5,6]. To date, 163 genetic susceptibility loci have been identified for IBD with 30 loci exclusive to CD, 23 to ulcerative colitis (UC), and 110 shared by the two [3]. Early-onset cases of IBD, with an age of onset before 10, often show a more severe disease course with a higher risk of complications, and genetic factors likely play a larger role in these individuals.

The exome sequence variants provided here were collected in Germany from unrelated CD patients and unrelated healthy individuals. Some of the CD patients developed the disease before their 10th year of life, corresponding to early-onset IBD [7].

Prediction challenge

With the provided exome data, identify which individuals have Crohn's disease and which individuals are healthy. In addition, identify which of the patients developed CD before age 10. Data from the Crohn's disease challenges in CAGIs 2011 and 2013 may be used for training. In addition, we provide age of onset data for the previous challenges, below.

Dataset description

Exome sequencing was performed using the TruSeq exome enrichment kit (Illumina) and the Illumina HiSeq2000 instrument. Reads were mapped to the human genome build hg19, and variants were called for all 111 exomes together using the Genome Analysis Toolkit (GATK version 3.3-0) Haplotype Caller. Variant calls were restricted to the TruSeq exome target. GATK was also used for variant quality score recalibration, and only high quality variants passing the filters were retained. Further details of data processing are available in the header of the provided vcf file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script will be provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row should include the following tab-separated fields:

In the template file, cells in columns 2-6 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You may enter predictions of disease status, disease onset, or both subchallenges. If you choose not to enter predictions for a subchallenge, leave the "*" in those cells. Within a subchallenge, you must enter a prediction and standard deviation for every individual; if you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter brief comments indicating the basis of the predictions; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

