CAGI Scientist Opening: apply here ☀️

Single amino-acid changes in the human p53 core domain that can restore activity of inactive p53 found in human cancers

Dataset description: public

Prediction datasets: public

The CAGI submission deadline for the p53 challenge has passed as of October 5, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

This challenge was offered in CAGI 2010, but due to experimental setbacks that delayed the experimental results, the dataset is now re-opened as a challenge for CAGI 2011.

Background 

The transcription factor p53 is a central tumor suppressor protein that controls DNA repair, cell cycle arrest, and apoptosis (programmed cell death). About half of human cancers have p53 mutations that inactivate p53. Over 250,000 US deaths yearly are due to tumors that express full-length p53 that has been inactivated by a single point mutation. For the past several years, the group of Rick Lathrop at University of California, Irvine, has been engaged in a complete functional census of p53 second-site suppressor (“cancer rescue”) mutations. These cancer rescue mutations are additional amino acids changes (to otherwise cancerous p53 mutations), which have been found to rescue p53 tumor suppressor function, reactivating otherwise inactive p53. These intragenic rescue mutations reactivate cancer mutant p53 in yeast and human cell assays by providing structural changes that compensate for the cancer mutation.

Prediction challenge

Predictors are asked to submit predictions on the effect of the cancer rescue mutants on four p53 cancer mutations as measured with yeast and/or human cell lines. The prediction should be the probability of a mutant being active.

Dataset, general description: The dataset is composed of calculations performed on in silico models of the mutant p53 structures. Features per instance (mutant) represent various aspects of the changes induced by the mutations, such as:

  • Changes in properties like (polarity, size, etc) in the mutant.
  • Changes in electrostatic and surface based features in the mutant.
  • Distance measures the movement in alpha carbon of each residue.
  • ddG predictions (stability metrics).

Training dataset: Overall, the training dataset contains 16,772 mutants, with each labeled as either ‘active’ or ‘inactive’. These class labels are determined by wet-lab experimental assays of p53 function in yeast and/or human cell lines.

The training dataset represents an exhaustive single-point mutagenesis of the entire core domain of p53 for the following p53 cancer mutations: r175h, r273h, and g245s.

The format here is x#y, where:

x is the single letter code for the wild type amino acid

# is the residue

y is the single letter code for the mutant amino acid.

Additionally, regional saturation mutagenesis of the following p53 cancer mutations are included: h179r, p151s, r280t, p278l, r248l, r273l, r249s, p152l, and r158l. While these mutations comprise most of the dataset, several hundred examples for other p53 cancer mutants are included.

Download training datasets:

Training dataset in ARFF format

More information on the ARFF format can be found at http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Training dataset in CSV format

Training dataset header file

Training dataset mutant list

Prediction dataset: The amino acid mutations are provided for each mutant in case predictors wish to derive predictions not based upon the additional structural data supplied. We welcome creative ideas and alternative predictive strategies.

The basis for our structural models is the core domain of wild-type p53 and can found in the PDB under the structure ID of 1TSR, specifically chain B: http://www.rcsb.org/pdb/explore/explore.do?structureId=1tsr

The prediction dataset represents an exhaustive single-point mutagenesis of the entire core domain of p53 for the following p53 cancer mutations: m237i, r248q, r282w and y220c. These cancer mutations were specifically chosen for CAGI to contain a diverse set of structural perturbations of p53 structure and function. The test dataset contains 14668 mutants.

Residue numbering is as in the 1TSR PDB file, where residue number 96 begins the core domain of p53 and residue number 289 terminates the core domain.

Download prediction datasets:

Prediction dataset in ARFF format

Prediction dataset in CSV format

Prediction dataset header file. Please note, the header files for training and prediction sets are identical.

Prediction dataset mutant list

References

For a reference to the primary biological strategies, please see: Baronio R, Danziger SA, Hall LV, Salmon K, Hatfield GW, Lathrop RH, Kaiser P. (2010) All-codon scanning identifies p53 cancer rescue mutations, Nucleic Acids Res. Nov 1;38(20):7079-88. doi: 0.1093/nar/gkq571

Additional references on the computational and biological strategies:

Danziger, S.A., Baronio, R., Ho, L., Hall, L., Salmon, K., Hatfield, G.W., Kaiser, P., and Lathrop, R.H. (2009) Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning, PLOS Computational Biology, 5(9), e1000498. doi: 10.1371/journal.pcbi.1000498

Danziger, S.A., Zeng, J., Wang, Y., Brachmann, R.K. and Lathrop, R.H. (2007) Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants, Bioinformatics, 23(13), 104-114. doi: 10.1093/bioinformatics/btm166

Danziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo, H., Luo, R., Baldi, P., Brachmann, R.K. and Lathrop, R.H. (2006) Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants, IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 3, 114-125. doi: 10.1109/TCBB.2006.22

Prediction Submission Format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download p53 submission template

NOTE: The validation script that was provided earlier was erroneous. 

Download the corrected (2 Oct 2011) p53 submission validation script (not available)

In the submitted file, each row should include the following columns:

  • Mutants - The double mutant as listed in the prediction dataset file, use the order as provided in the template file
  • Prediction - The probability of the double mutant being active (range: 0-1)
  • Standard deviation - Defines the confidence of the prediction in column 2

In the template file, cells in columns 2-3 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a double mutant, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page

Dataset provided by

Rick Lathrop and the p53 "cancer rescue" team, University of California Irvine