CAGI Scientist Opening: apply here ☀️

Distinguish between exomes of Crohn's disease patients and healthy individuals

Dataset description: public

Exome sequencing data: registered users only, limited by CAGI Data Use Agreement

The CAGI submission deadline for the Crohn's disease challenge has passed as of October 26, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

Background 

The dataset includes exome sequence data (in the VCF format) from Crohn's disease patients and healthy individuals.

Prediction challenge

With the provided exome sequencing data, identify which individuals have Crohn's disease and which individuals are healthy.

Download dataset: The dataset file is only available for registered users, please register to the site to access.

Zipped file contains the VCF files of total of 56 individuals. The files are randomly numbered from 1-56.

Sequencing was done using Illumina instruments. Basecalling was performed with the pipeline: BWA -> Picard (duplicate removal) -> Samtools Pileup. The chromosomal positions in the VCF files are according to human build: hg18.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download Crohn's Disease submission template

Download Crohn's Disease validation script (not available)

In the submitted file, each row should include the following columns:

  • ID - The ID number of the individual
  • Prediction - The probability that this individual has Crohn's disease (the probability should be a value between 0 and 1, 0 meaning healthy and 1 meaning Crohn's patient)
  • Standard deviation - This defines the confidence of the prediction in column 2. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.

In the template file, cells in columns 2-3 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for an individual, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly. 

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file. 

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by

Andre Franke, Christian-Albrechts-Universität zu Kiel