Distinguish between exomes of Crohn's disease patients and healthy individuals

Dataset description: public
Exome sequencing data: registered users only, limited by CAGI Data use agreement

This challenge closed on 25 April 2013.

Crohn's Challenge answer key (276 KB, zip): registered users only, limited by CAGI Data use agreement

Assessor summary (688 KB, zip): registered users only, limited by CAGI Data use agreement

Slides from the CAGI conference : registered users only, limited by CAGI Data use agreement
      Alexander Morgan : Assessment (8.3 MB, remixable ppt)
      Predrag Radivojac : Predictor Talk (4.5 MB, remixable ppt)
      Pietro Di Lena: Predictor Talk (2.6 MB, remixable ppt)
      Manuel Giollo : Predictor Talk (5.7 MB, remixable ppt)

Predictions (2.1 MB, zip): registered users only, limited by CAGI Data use agreement

Background: Crohn's disease (CD [MIM 266600]) a form of inflammatory bowel disease (IBD) is a complex genetic disorder characterized by chronic relapsing inflammation that can involve any part of the gastrointestinal tract. Evidence from twin and familial aggregation studies [1], as well as evidence from genome-wide association studies [2-4] suggest that genetic factors play an important role in CD etiology. The disease burden of CD is high, often associated with complications such as intestinal ulcers, fistulas and increased risk of colon cancer. Accumulating evidence suggests that immune tolerance to normal intestinal bacteria is disturbed in genetically susceptible individuals [5], leading to a pathogenic inflammatory response. NOD2 (nucleotide-binding oligomerization domain 2; alias, CARD15), was the first susceptibility gene identified for Crohn's disease. This gene is involved in the innate immune response directed against components of the bacterial cell wall [6,7]. A recent meta analysis of six CD genome-wide association studies comprising of 6,333 CD patients and 15056 healthy patients combined with previously confirmed loci [8] leads to a total of 71 distinct loci associated with CD [2].

Prediction challenge: With the provided exome data, identify which individuals have Crohn's disease and which individuals are healthy.

Reworked Crohn's disease challenge data is now available (26 Dec 2012):
The dataset file is only available for registered users, please log in to access the dataset.

Please note: We are accepting predictions only for this reworked Crohn's dataset (posted on 26 December 2012)

Dataset description: The reworked Crohn's disease dataset consists of variant calls made for all 66 exomes together, providing better quality variant calls, and critical information for interpreting relatives. A TruSeq exome bed file was used for combined variant calling for all the 66 exomes using the GATK program. The VQSR method was employed to identify true polymorphisms in the samples rather than those due to sequencing, alignment, or data processing artifacts. The VCF file contains variants segregated into tranches based on their quality scores, and the variants can be filtered by quality according to the “Filter” column in the VCF file. The TruSeq exome enrichment kit was also used for capturing all the 66 exomes and all the samples were sequenced using the same protocol. The chromosomal positions in the vcf files are according to human build: hg19.

The exome samples are derived from a total of 28 different pedigrees, including 1 monozygous discordant twin pair. Out of 66 exomes, 51 are Crohn's patient and rest are healthy.

Prediction submission format: The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download Crohn's Disease submission template
Download Crohn's Disease validation script

In the submitted file, each row should include the following columns:

  1. ID - The ID number of the individual
  2. Prediction - The probability that this individual has Crohn's disease (the probability should be a value between 0 and 1, 0 meaning healthy and 1 meaning Crohn's patient)
  3. Standard deviation - This defines the confidence of the prediction in column 2. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.
In the template file, cells in columns 2-3 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for an individual, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page

Download OLD dataset (released 5 October 2012):
The dataset file is only available for registered users, please log in to access the dataset.

Please note: We are accepting predictions only for the new Crohn's dataset (posted on 26 December 2012)


  1. Halme L, Paavola-Sakki P, Turunen U, Lappalainen M, Farkkila M, Kontula K. 2006. Family and twin studies in inflammatory bowel disease. World J Gastroenterol 12 :3668-3672.
  2. Franke A, McGovern DP, Barrett JC, Wang K, Radford-Smith GL, Ahmad T, Lees CW, Balschun T, Lee J, Roberts R, et al. 2010. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat. Genet. 42:1118-1125.
  3. Raelson, JV, Randall, DL, Ruether, A. et al. 2007. Genome-wide association study for Crohn’s disease in the Quebec Founder Population identifies multiple validated disease loci. Proceedings of the National Academy of Sciences of the United States of America 104:14747-52.
  4. Ellinghaus D, Ellinghaus E, Nair, R P, et al. 2012. Combined analysis of Genome-wide association studies for Crohn Disease and Psoriasis identifies seven shared susceptibility loci. The American Journal of Human Genetics 90:636-647.
  5. Cho JH. 2008. The genetics and immunopathogenesis of inflammatory bowel disease. Nat. Rev. Immunol. 8:458-466.
  6. Hugot JP, Chamaillard M, Zouali H, Lesage S, Cezard JP, Belaiche J, et al. 2001. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature. 411:599-603
  7. Ogura Y, Bonen DK, Inohara N, Nicolae DL, Chen FF, Ramos R, et al. 2001. A frameshift mutation in NOD2 associated with susceptibility to Crohn’s disease. Nature 411:603-606.
  8. Lesage S, Zouali H, Cezard JP, Colombel JF, Belaiche J, Almer S, et al. 2002. CARD15/NOD2 mutational analysis and genotype-phenotype correlation in 612 patients with inflammatory bowel disease. Am J Hum Genet 70:845-857.

Dataset provided by:

Andre Franke

Andre Franke, Christian–Albrechts–Universität zu Kiel

Assessment: This challenge is being assessed by Alexander Morgan, Stanford University.