Discordant monozygotic twins – identify the twin with asthma

Dataset description: public
Whole genome and RNA data: registered users only, limited by CAGI Data use agreement

The CAGI submission deadline for the Asthma challenge has passed as of October 6, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

Background: The dataset includes whole genomes of 8 pairs of discordant monozygotic twins (randomly numbered from 1 to 16) that is, in each pair identical twins one has asthma and one does not. In addition, RNA sequencing data for each individual is provided. One of the twins in each pair suffers from asthma while the other twin is healthy.

Prediction challenge: With the provided whole genome and RNA sequencing data, identify which two individuals are paired, and which one of the two has asthma.

Download dataset:

The dataset file is only available for registered users, please register.

NOTE: The file size is 1.1Gb. It may take some time for the download to start.

The whole genomes (in folder ‘DNA’) were sequenced with Complete Genomics, and a simplified variant file is provided (containing chromosome location and variant calls). The data has an average Gross Mapping Yield of 209.8 GB (ranging from 168-376GB). More detailed description of the format is available at the Complete Genomics website (version 1.10).

The RNA-Seq data (in folder ‘RNA’) are provided in RPKMs of the genes, which were mapped and called using DNANexus.com. The libraries were prepared according to Illumina's protocol with 200bp insert size, and each library was sequenced on 1 lane of HiSeq 2000.

Added 3 September, 2011: Due to requests from the predictors, we asked the data providers to tell us the source of the data. The sequencing results for the asthma dataset were obtained with DNA and RNA isolated from peripheral blood mononuclear cells.

Added 14 September, 2011: The type of the asthma is: "adult asthma". In the RNAseq file, 0 means the RPKM for that specific gene is 0, i.e. the gene is not expressed or expression level is below detection. The ethnicity of the twins is not known.

Prediction submission format: The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Download Asthma challenge submission template
Download Asthma challenge validation script

In the submitted file, each row should include the following columns:

  1. Twin Pair - ID number to define a twin pair (range 1-8)
  2. Twin 1 - Insert here the number of the first individual of a twin pair (range 1-16)
  3. Twin 2 - Insert here the number of the second individual of the twin pair (range 1-16)
  4. Twin with asthma - Insert here the number from column 2 or 3, depending on which individual you predict to have asthma (range 1-16)
  5. Probability - Give here the probability that the individual in column 4 has Asthma (the probability should be a value between 0 and 1, 0 meaning healthy and 1 meaning Asthma patient)
  6. Standard deviation - This defines the confidence of the prediction in column 6. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.
In the template file, cells in columns 2-6 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a twin pair, leave the sign "*" in these cells. Make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by:

Michael Snyder, Kari Nadeau and Rui Chen, Stanford University