CAGI Scientist Opening: apply here ☀️
Personal genome project (PGP) - Predict individuals' phenotypes
Dataset description: public
Prediction dataset: public
This challenge closed on 24 April 2013.
Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement
Madeleine Ball: Data Provider Talk (2.8 MB, remixable ppt)
Sean Mooney: Assessment (2.8 MB, remixable ppt)
Rachel Karchin: Predictor Talk (4.6 MB, remixable ppt)
Mary Wahl: Predictor Talk (11 MB, remixable ppt)
The challenge is to submit predictions that match genomes to phenotypic profiles (trait lists).
The phenotypic profiles are lists of self-reported traits from a questionnaire completed by PGP participants. This list has 239 binary traits, which include both Mendelian traits and more difficult “complex” traits. Many of these have significant environmental components, and many individual traits may be difficult or impossible to predict with current scientific knowledge. In addition to the binary traits from the questionnaire, some PGP participants have provided additional phenotypic and genetic information.
Updated 12 Jan 2013: There are 77 genomes currently available, each of which matches a phenotypic profile. In addition, as “decoys,” the challenge includes 214 additional phenotypic profiles that do not match any genome. The challenge is to match each genome to a profile. (When released on 26 Dec 2012 there were 53 genomes, 157 decoys.) There will be no further genomes added to the challenge.
Optional Supplement: An optional, “interest only” supplement will be to predict individual traits from genomes.
In previous CAGI experiments, the PGP challenge was to predict individual phenotypic traits based on a genome; future CAGI experiments may be similar. This year, however, phenotypic profiles of all PGP participants in CAGI are public, and therefore the challenge is to match entire profiles to genomes. The availability of the complete profiles makes it impossible to have a valid assessment of individual trait predictions. Inference of a single readily-predictable trait might identify a matching profile with high confidence; that profile could then be used to “backfill” other traits that would not otherwise have been readily predicted. As such, this year we cannot have an individual trait assessment challenge which meets the CAGI standard of blind prediction from genetic information.
We nevertheless offer this supplement, because predictions of individual traits from genetic information have several areas of potential utility for PGP and CAGI. First and foremost, these predictions may be a service to the PGP participants who donated their genomes because phenotypic profiling is imperfect. Individuals may be at risk from an undiagnosed disease, or at high risk of developing a particular disease in future, or have unrecognized traits. Predictors are asked to assign traits on this basis, even when the assignment may differ from current profile information (for example, assigning a high risk to type II diabetes, even though the individual does not currently have the disease). The PGP participants look forward to seeing predictions from their genomes, and learning from them.
In addition, this year’s individual trait predictions may have value as an imperfect baseline for comparison with future CAGI experiments. If the ability to predict some trait is significantly better in the future than this year, it is likely due to improved methods. (If future prediction ability is worse, we cannot tell for certain whether that may be due to the full-profile availability this year, and thus would be uninformative.) Finally, it may be useful to discuss methods for predicting individual traits in the context of these data, even if the results cannot be effectively assessed.
In order to make this supplement as valuable as possible, predictors should not use the complete trait information in making their predictions. We emphasize that because this trait profile information is available to all predictors, this supplement is not a CAGI challenge, and cannot be used for comparison between different groups or to reliably measure the current state-of-the-art in predicting individual traits.
Download genomic challenge data (Updated 20 Jan 2013):
Genomic challenge data (Compressed tar file, Complete Genomics TSV format) (17.3 GB)
Genomic challenge data (Compressed tar file, GFF format) (8.5GB)
Genomic training data (Compressed tar file, TSV format) (4.8GB)
Genomic training data (Compressed tar file, GFF format) (2.3GB)
The above tar files contain all the released challenge and training genomic data. Here is a list of the genomic files contained in the challenge data tar files including the date that the files were released. Here is a list (updated 17 Apr to include a single omitted genome) of the genomic files contained in the training data tar files including the date that the files were released.
The genomic data was obtained by sequencing DNA purified from saliva or blood. Sequencing was performed by Complete Genomics using the 2.0 Standard pipeline.
The first file is a Complete Genomics "var" format file from 2.0 Standard pipeline data, as described on page 20 of the "DataFileFormats Standard Pipeline 2.0." Please note: PGP has posted the complete whole sequence data as provided by the sequencing provider. Current WGS technology typically does not provide specific tandem repeat data.
The second file is a non-standard GFF format file produced by the PGP's GET-Evidence system, provided as-is to assist interpreters. The GET-Evidence interpreter may have flaws in its interpretation of the Complete Genomics file -- the Complete Genomics file is considered correct should the two disagree. GET-Evidence's GFF format is described here.
Download phenotypic profile data (updated 12 Jan 2013):
The above Challenge profile list is larger than and supersedes the originally released list. There are now 291 profiles; 77 match a genome and 214 are decoys (do not match a genome). The originally released list of 210 profiles (26 Dec 2012) had some defects due to misplaced delimiters, defects which have since been corrected.
Challenge profiles: The list of profiles which the genomes will be matched against is provided in the CSV format spreadsheet given above (Challenge profile data). The spreadsheet includes an identifier, the phenotypic data, and whether there is additional information. Specifically, the first two columns contain the huID and PGP page URL for the profile. There are many more profiles ("huIDs") than genomes; i.e., most huIDs do not match any available genome. (Every genome matches exactly one huID.) Please note: All traits are self-reported, including gender/sex. In addition to the self-reported traits from questionnaires, some individuals have made available 1) additional phenotypic data (such as blood type, blood lipid analysis, etc.), and 2) genotypic array data (such as from 23andMe). The presence of these additional data is indicated in the "Additional phenotypic data" and "Genotypic array data" columns to the right of the profile URL column, Y/N for the presence/absence of additional data. The additional phenotypic data is listed in tables within the individual profiles and is grouped by categories: Demographics (such as height and weight, blood type), Conditions (medical diagnoses), Allergies, etc. In some cases biometric data and images (such as MRI) are available. For most profiles the genotypic array data column states “N” indicating no available genotypic array data, but if one set/type of is available, it is named; if more than one, they are listed with semicolons delimiting. The additional data are currently available via the individual profile URLs and will also be made available soon in downloadable tar files below. In cases where genotypic array data are available, these data may be used in the prediction process. Indeed, it is expected the array data will make those profile-genome matches straightforward; these matches could be used by predictors as additional training data (see also below).
Training profiles: A second spreadsheet given above (Training profile data) contains 12 profiles for which a genome has already been matched. The format of this spreadsheet is the same as the "Challenge profiles" spreadsheet except that there are two extra columns, “Genomic data” (Complete Genomics and GFF formats), containing the file names for the genomic data for each profile. The profiles on the Training data spreadsheet are not part of the challenge.
Additional phenotypic data (see “Challenge profiles” above)
Additional genotypic array data (see “Challenge profiles” above)
Note that additional genomes were added on 12 January 2013. No additional genomes will be added. It is anticipated, however, that some genomes will become publicly matched to a profile before the close of the challenge. Such cases will immediately become "training data." A list of prematurely revealed matches will be updated regularly.
Challenge submission template: For each genome, give the probability of a match to each of the profiles. In the tab-delimited submission file provided the first column lists the genome and the first row lists the tab-delimited profile IDs followed by P (for probability) or SD (for standard deviation). Each "*" should be replaced by a probability P of a match (real number, 0 to 1), and a standard deviation SD that indicates the confidence of that prediction. Genomes (rows) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment. As before, note every genome matches exactly one huID, but that many huID profiles do not match any genome.
Matching challenge submission template format
Genome huID 1-P huID 1-SD huID 2-P huID 2-SD ...
genome 1 * * * * *
genome 2 * * * * *
... * * * * *
Optional supplement submission template: In the tab-delimited submission file provided, the first column lists the genome. Each subsequent trio of columns, corresponding to a particular trait, should contain the probability (P, 0 to 1) that the individual has the trait; a standard deviation (SD) indicating the confidence of the probability; and integer(s) indicating the method(s) (M), as defined in the predictor's method write-up. Submit your predictions by replacing the "*" with your value. Multiple methods should be indicated by comma-separated integers. Whitespace other than tabs is optional. There is no constraint on the probabilities (other than the range 0 to 1); the null prediction is "*".
Optional trait challenge submission template format (example)
Genome trait 1-P trait 1-SD trait 1-M trait 2-P trait 2-SD trait 2-M
genome 1 0.8 0.1 4,7 0.1 0.7 3
genome 2 0.5 0.4 7 0.5 0.2 3
No empty cells are allowed in the submission; if you cannot submit predictions for a phenotype, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly. Validation scripts will be provided so that predictors can check the correctness of the format before submitting their predictions.
Methods: In addition to the template-derived submission files, your submission must include a detailed description of the methods used to make the predictions in a separate file (similar to the style of the Methods section in a scientific article). If you provide the optional supplement, please include the integer designation of each method.
Download PGP challenge validation script (not available)
Download PGP optional challenge validation script (not available)
Updates were made to the template files on 17 Apr 2013 to add an omitted genome (fb74440463ffaaedecdac54b38ba2db7915965a0) and remove a training genome (474576e34ad0f39488d9c9b75946f7a7a4248427). The files are also now properly sorted by genome ID.
To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.
Data provided by
George Church, Madeleine Price Ball, Jason Bobe, Ward Vandewege, Tom Clegg, Alexander Wait Zaranek, the rest of the PGP staff (Harvard Medical School), and all of the PGP participants who lent their genomes to the CAGI 2012 experiment.
This challenge is being assessed by Sean Mooney, Buck Institute, California.