CAGI Scientist Opening: apply here ☀️

Personal genome project (PGP) - Predict individuals' phenotypes

Dataset description: public

Prediction datasets: public

The CAGI submission deadline for the PGP challenge has passed as of October 31, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.


The first ten genomes of participants in the PGP project are publicly available at genomes. The public profiles of these persons are available on the same page, follow the link under "Get report" for each participant (PGP1-PGP10). The old profiles that are no longer updated are located at:

Prediction challenge

Submit predictions of the phenotypes for the 10 PGP individuals (PGP1-PGP10). For binary traits, submit the probability of a person having the phenotype. For the numerical traits, submit the numerical mean and standard deviation. We also welcome predictors to name additional phenotypes that they might discover from the genomes, including rare diseases. We will ask the PGP participants to report their phenotypes for these traits when possible.

Additional challenges (added 3 September 2011): To provide CAGI participants with a different approach to analyze the PGP10 data, the organizers decided to release the following three new challenges:

  • One the of PGP10 individuals has irritable bowel syndrome. Predict who that individual is.
  • One of the PGP10 individuals is color-blind. Predict who he is.
  • One of the PGP10 individuals is not color-blind but she has a color-blind father and an affected son. Predict who that individual is.

The prediction submission for these additional challenges will be the probability of a PGP individual having the specific trait. If submitting probabilities for more than one individual per trait, then the probabilities should sum up to 1.


Genomic data for PGP1-PGP10

The list of phenotypes for the PGP Challenge

Statement on selecting the phenotypes for the PGP challenge

The list of phenotypes has been built based on the idea of having both mostly Mendelian diseases and more difficult “complex” disorders included in the list. Several of these have significant environmental components, and many phenotypes are difficult or impossible to predict with current scientific knowledge. This has been a conscious decision with an aim of provoking discussion on how the phenotypes should be selected in further rounds of CAGI. These provide a measure for assessing current ability to predict a broad range of phenotypes. They also provide a baseline for long-term progress in predictive ability.

We are asking participants to predict the probability of each of the PGP10 individuals having any of the binary phenotypes on our list, and to predict a numeric value for the continuous characters (e.g., LDL) along with a confidence interval. We also invite predictions on any additional genetic phenotypes.

Most of these phenotypes are ones not yet publicly provided by the PGP10 on the project website. As predictors make their predictions, the PGP10 participants will be asked to provide these phenotypes so that they may be used by assessors to determine the accuracy of the predictions. If a phenotype is already listed in the public profile of an individual (e.g., migraine), the prediction of that phenotype for the individual will not be assessed. However, we will assess predictions of such phenotype for other individuals, as long as the phenotype is not listed in the public profile of that individual.

At the CAGI workshop, we will present the assessment on how well the phenotypes were predicted, based on the answers from PGP10. Most importantly, however, we will encourage a discussion on what kind of phenotypes should be included in the next challenge. For example, some of the phenotypes in the CAGI list are periodic, and currently, we do not specify whether these should be measured at a single time point or at all times (and may not have been manifested yet). Another question is whether the signal is large enough in the PGP10 sample of 10 genomes to predict any of the more rare diseases.

The PGP10 challenge was introduced in CAGI 2010, where we received one prediction submission for this dataset. At the CAGI 2010 workshop, it was decided to re-open the PGP10 challenge for CAGI 2011, and therefore, the phenotypes that we obtained from PGP10 participants during autumn 2010 remain unpublished.

We look forward to your ideas and comments on how to select predictable and assessable phenotypes for the next CAGI challenge.

Prediction submission format

The prediction submission is a tab-delimited text file. For each of the two PGP challenges, organizers provide a separate template file, which should be used for submission. In addition, validation scripts are provided, and predictors should check the correctness of the format before submitting their predictions.

Instructions for the original PGP challenge submission:

Download file template for original PGP challenge

Download PGP Challenge validation script (not available).

In the submitted file for the original PGP challenge, each row should include thcagi2-pgp-e following columns:

  • PGP ID - This should be the ID of the individual (PGP1-PGP10).
  • Prediction-Trait1 - For phenotypes 1-32, provide the probability, which should be within range 0-1; For phenotypes 33-40, provide the numerical mean.
  • Standard deviation - This defines the confidence of the prediction in column 2. High SD means low confidence, while small SD means that the predictor is confident about the submitted prediction.

      ... Repeat columns 2-3 for all 40 phenotypes. The order of the phenotypes is from the CAGI phenotype list.

NOTE: To submit predictions for phenotypes outside of the CAGI list, these additional phenotypes should be submitted after the listed phenotypes for each PGP individual (from column 82 onwards). Then the phenotype should be named in the column header, prediction submitted for each PGP individual in the Prediction column and SD in the following column. No empty cells are allowed.

Instructions for the PGP Additional Challenges submission:

Download file template for PGP Additional Challenges

Download PGP Additional Challenges validation script (not available)

In the submitted file for the PGP Additional challenges, each row should include the following columns:

  • PGP ID - This should be the ID of the individual (PGP1-PGP10).
  • Prediction-Trait1 - Provide the probability of the the PGP individual having the trait "irritable bowel syndrome". The probability should be a number within range 0-1, and the sum of probabilities in column 2 should be "1".
  • Prediction-Trait2 - Provide the probability of the PGP individual having the trait "color-blind". The probability should be a number within range 0-1, and the sum of probabilities in column 3 should be "1".
  • Prediction-Trait3 - Provide the probability of the PGP individual having the trait "not color-blind but has a color-blind father and an affected son". The probability should be a number within range 0-1, and the sum of probabilities in column 3 should be "1".

In the template files, cells in columns 2 onwards are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a phenotype, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group.  For more details, please read the FAQ page

Dataset provided by

George Church, Harvard Medical School.

Phenotypes proposed by CAGI organizers in consultation with George Church.