CAGI Scientist Opening: apply here ☀️

Mouse exomes dataset - predict the causative mutation of a given phenotype

Dataset description: public

Exome sequencing data: registered users only, limited by CAGI Data Use Agreement

The CAGI submission deadline for the mouse exomes challenge has passed as of September 1, 2011. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment. 

The experimental results for this dataset have been published: Fairfield et al., Mutation discovery in mice by whole exome sequencing. Genome Biology 2011, 12:R86. doi:10.1186/gb-2011-12-9-r86 

Background 

The exomes of four mice with four different phenotypes have been sequenced. The phenotypes are:

  • L11Jus74 - phenotype = recessive, perinatal lethality (recessive but maintained in heterozygote, ENU-induced)
  • Sofa - phenotype (dominant, spontaneous)
  • Frg - phenotype (recessive, spontaneous)
  • Stn - phenotype (recessive, spontaenous)

Prediction challenge

With the provided exome sequencing data, please predict the causative mutation(s) of each of the four mouse phenotypes listed above. The predictions for phenotypes 1-2 will be compared with the mutations mapped using traditional crosses. For phenotypes 3-4, variants in the mapped area were not identified, and thus predictions for these phenotypes will only be compared to each other. The dataset providers are keen to pursue variants discovered by CAGI participants.

Download dataset: The dataset file is only available for registered users, please register to the site to access.

This packaged file contains the indel and SNP files in VCF format for each of the four mice. The phenotype of the mouse is indicated in the filename, and the package should contain a total of 8 files.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row should include the following columns:

  • Phenotype - This column should include a number between 1-4. The phenotypes are numbered as above: 1: L11Jus74, 2: Sofa, 3: Frg, and 4: Stn. For each phenotype, maximum of 100 variants may be submitted.
  • Variant - Please submit the variant in HGVS format.
  • Prediction - This column should include a value between 0-1. For each phenotype, the values in column 3 ("Prediction") should sum up to 1. If you submit 15 variants for phenotype "1", then the sum of the probabilities for these 15 variants should be 1, e.g. you could have 5 variants with p=0.1 and 10 variants with p=0.05.

In the template file, cells are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a specific phenotype, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly. 

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by

Jay Shendure, University of Washington

Monica Justice, Baylor College of Medicine, and Timothy Cox, Seattle Children's Hospital