CAGI Scientist Opening: apply here ☀️

Discordant monozygotic twins - identify the twin with asthma

Dataset description: public

While genome and RNA data: registered users only, limited by CAGI Data Use Agreement

The CAGI submission deadline for the Asthma challenge has passed as of October 6, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.


The dataset includes whole genomes of 8 pairs of discordant monozygotic twins (randomly numbered from 1 to 16) that is, in each pair identical twins one has asthma and one does not. In addition, RNA sequencing data for each individual is provided. One of the twins in each pair suffers from asthma while the other twin is healthy.

Prediction challenge

With the provided whole genome and RNA sequencing data, identify which two individuals are paired, and which one of the two has asthma.

Download dataset: The dataset file is only available for registered users, please register.

NOTE: The file size is 1.1Gb. It may take some time for the download to start. 

The whole genomes (in folder ‘DNA’) were sequenced with Complete Genomics, and a simplified variant file is provided (containing chromosome location and variant calls). The data has an average Gross Mapping Yield of 209.8 GB (ranging from 168-376GB). More detailed description of the format is available at the Complete Genomics website (version 1.10).

The RNA-Seq data (in folder ‘RNA’) are provided in RPKMs of the genes, which were mapped and called using The libraries were prepared according to Illumina's protocol with 200bp insert size, and each library was sequenced on 1 lane of HiSeq 2000.

Added 3 September, 2011: Due to requests from the predictors, we asked the data providers to tell us the source of the data. The sequencing results for the asthma dataset were obtained with DNA and RNA isolated from peripheral blood mononuclear cells. 

Added 14 September, 2011: The type of the asthma is: "adult asthma". In the RNAseq file, 0 means the RPKM for that specific gene is 0, i.e. the gene is not expressed or expression level is below detection. The ethnicity of the twins is not known.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions. 

Download Asthma challenge submission template.

Download Asthma challenge validation script (not available).

In the submitted file, each row should include the following columns:

In the template file, cells in columns 2-6 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a twin pair, leave the sign "*" in these cells. Make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file. 

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by 

Michael Snyder, Kari Nadeau and Rui Chen, Stanford University