CAGI Scientist Opening: apply here ☀️

The Personal genome project (PGP): predict individuals' phenotypes

Challenge: PGP

Dataset description: public

Genome sequence and phenotype data: public

Last updated: 21 April 2016

This challenge closed at 9:00 PM PST (Pacific Standard Time) on 7 December 2015.

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.

Summary 

The PGP provides hundreds of publicly available genomes of consenting individuals. The genomes are associated with self-reported phenotype profiles of 239 traits, including diseases and other traits such as color blindness. As for CAGI 2013, newly sequenced genomes are being “lent” by PGP participants to CAGI for a genome-to-phenotype matching challenge. All genomes and phenotype profiles are publicly available for this challenge, but the linkage (which genome matches which phenotype profile) will not be revealed until the close of the CAGI challenge.

Prediction challenge

The challenge is to submit predictions that match genomes to phenotype profiles. The phenotype profiles are lists of self-reported traits from a questionnaire completed by PGP participants. This list has 239 binary traits, which include both Mendelian traits and “complex” traits. Many of the latter have significant environmental components, and many individual traits may be difficult or impossible to predict with current scientific knowledge. In addition to the binary traits from the questionnaire, some PGP participants have provided additional phenotypic and genetic information.

There are 23 genomes in the prediction challenge, each of which matches a phenotype profile. In addition, as "decoys", the challenge includes 78 additional phenotype profiles that do not match any genome. The challenge is to match each genome to a profile. With this final update to the challenge dataset, we have increased the number of genomes from 13 to 23, and we have decreased the number of phenotype profiles from 228 to 101 to match the ratio of profiles to genes in the PGP challenge in CAGI 2013.

The PGP challenge in CAGI 2013 was very similar in design, and those data may be used for training. All matched PGP genomes and profiles may also be used for training: the data are available at http://blog.personalgenomes.org/2014/05/30/pgp-harvard-data-in-google-cloud-storage/. The organizers provide 145 matches genomes and profiles below, comprising all of the data in the 2013 challenge and all other matched PGP genomes and profiles for which complete survey data are available.

Supplemental challenge

An optional, “interest only” supplement will be to predict individual traits from genomes.

Because the phenotype profiles of all PGP participants in CAGI are already public, and the challenge is to match entire profiles to genomes, it is very difficult to have a valid assessment of individual trait predictions. Inference of a single readily-predictable trait might identify a matching profile with high confidence; that profile could then be used to “backfill” other traits that would not otherwise have been readily predicted. As such, this year we cannot have an individual trait assessment challenge which meets the CAGI standard of blind prediction from genetic information.

We nevertheless offer this supplement, because predictions of individual traits from genetic information have several areas of potential utility for PGP and CAGI. First and foremost, these predictions may be a service to the PGP participants who donated their genomes because phenotype profiling is imperfect. Individuals may be at risk from an undiagnosed disease, or at high risk of developing a particular disease in future, or have unrecognized traits. Predictors are asked to assign traits on this basis, even when the assignment may differ from current profile information (for example, assigning a high risk to type II diabetes, even though the individual does not currently have the disease). The PGP participants look forward to seeing predictions from their genomes, and learning from them.

In addition, this year’s individual trait predictions may have value as an imperfect baseline for comparison with future CAGI experiments. If the ability to predict some trait is significantly better in the future than this year, it is likely due to improved methods. (If future prediction ability is worse, we cannot tell for certain whether that may be due to the full-profile availability this year, and thus would be uninformative.) Finally, it may be useful to discuss methods for predicting individual traits in the context of these data, even if the results cannot be effectively assessed.

In order to make this supplement as valuable as possible, predictors should not use the complete trait information in making their predictions. We emphasize that because this trait profile information is available to all predictors, this supplement is not a CAGI challenge, and cannot be used for comparison between different groups or to reliably measure the current state-of-the-art in predicting individual traits.

Download genomic sequence dataset: 

4-PGP_genomes_dataset_var_v2.tar.bz2 (6.5 GB) 

4-PGP_genomes_dataset_vcf_v2.tar.bz2 (4.4 GB) 

Download genomic sequence training data:

4-PGP_genomes_training_var.tar.bz2 (33 GB) 

4-PGP_genomes_training_vcf.tar.bz2 (23 GB) 

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by Complete Genomics. The data files are provided in the Complete Genomics "var" format, described here. In addition, the data have been converted from var format to vcf format using the mkvcf command from cgatools. These vcf files are also provided, but participants should be aware that mkvcf is a beta release, and the organizers cannot guarantee that the vcf files contain all of the information present in the var files. Please note: PGP has posted the complete sequence data as provided by the sequencing provider. Current WGS technology typically does not provide specific tandem repeat data.

Download phenotype profile data:

4-PGP_profile_dataset_v2.txt (42 KB) 

4-PGP_profiles_dataset_json_v2.tar.bz2 (22 KB) 

Download phenotype profile training data:

4-PGP_profiles_training.txt (58 KB) 

4-PGP_profiles_training_json.tar.bz2 (67 KB) 

The list of profiles to which the genomes will be matched is provided in the TSV format spreadsheet given above. The first three columns contain the identifier (huID) and PGP web page URLs for the profile data in HTML and JSON format. The remaining columns contain the binary phenotype profile data. There are many more profiles than genomes; i.e., most huIDs do not match any available genome. (Every genome matches exactly one huID.) Please note: all traits are self-reported, including gender/sex. In addition to the self-reported traits from questionnaires, some individuals may have made available additional phenotypic data (such as blood type, blood lipid analysis, etc). These additional phenotypic data are listed in tables within the individual profiles and are grouped by categories: Demographics (such as height and weight, blood type), Conditions (medical diagnoses), Allergies, etc. In some cases biometric data and images (such as MRI) may be available. The additional phenotypic data are available via the individual profile URLs and can be searched effectively using the JSON format. The complete set of extended profiles in JSON format is provided above. 

Challenge submission format 

For each genome, give the probability of a match to each of the profiles. In the tab-delimited submission file provided, the first column lists the genome and the first row lists the profile IDs followed by P (for probability) or SD (for standard deviation). Each "*" should be replaced by a probability P of a match (real number, 0 to 1) and a standard deviation SD indicating the confidence of that prediction. Genomes (rows) for which no predictions are made should be left all "*". These will be treated as probability zero assignments in the assessment. As described above, note every genome matches exactly one profile (huID), but that many profiles do not match any genome. For each genome, optional comments on the basis of the predictions may be given; otherwise, leave the "*" in these cells.

Optional supplement submission format 

In the tab-delimited submission file provided, the first column lists the genome. Each subsequent trio of columns, corresponding to a particular trait, should contain the probability (P, 0 to 1) that the individual has the trait; a standard deviation (SD) indicating the confidence of the probability; and integer(s) indicating the method(s) (M), as defined in the predictor's method write-up. Submit your predictions by replacing the "*" with your value. Multiple methods should be indicated by comma-separated integers. Whitespace other than tabs is optional. There is no constraint on the probabilities (other than the range 0 to 1); the null prediction is "*".

No empty cells are allowed in the submission; if you cannot submit predictions for a phenotype, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly. Validation scripts are provided, and predictors should check the correctness of the format before submitting their predictions. In addition to the template-derived submission files, your submission must include a detailed description of the methods used to make the predictions in a separate file, similar to the style of the Methods section in a scientific article. If you provide the optional supplement, please include the integer designation of each method.

In addition to the template-derived submission files, your submission must include a detailed description of the methods used to make the predictions in a separate file, similar to the style of the Methods section in a scientific article. If you provide the optional supplement, please include the integer designation of each method. 

Download challenge submission template file: The dataset file is only available for registered users. Please log in to access the file.

Download challenge submission validation script: This submission validation script is available only to registered users. Please log in to access the file.

Download supplement submission template: The dataset file is only available for registered users. Please log in to access the file.

Download optional challenge validation script: This submission validation script is available only to registered users. Please log in to access the file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Data provided by 

George Church, Madeleine Price Ball, Abram Connelly, Preston Estep, Nancy Ouyang, Ward Vandewege, the rest of the PGP staff (Harvard Medical School), and all of the PGP participants who lent their genomes to the CAGI 4 experiment.

Updates

4 Sep 2015 (v01): initial release 

6 Oct 2015 (v02): re-release of 13 challenge genomes in var and vcf formats; release of challenge profiles in json format; release of submission template files 

12 Oct 2015 (v03): training data released 

26 Oct 2015 (v04): challenge genomes increased from 13 to 23; profiles decreased from 228 to 101 

28 Oct 2015 (v05): submission instructions and templates updated, validation scripts provided 

12 Nov 2015 (v06): improved validation script provided 

18 Dec 2015 (v07): answer key provided 

18 Mar 2016 (v08): predictions provided 

21 Apr 2016 (v09): presentations provided