Breast cancer cell line pharmacogenomics dataset

The experimental results for this challenge have been published: Heiser LM et al., Subtype and pathway specific responses to anticancer compounds in breast cancer.

Dataset description: public
SNP data (including expression and CNV data): public
Cell line exomes: registered users only, limited by CAGI Data use agreement

The CAGI submission deadline for the cancer cell lines challenge has passed as of September 30, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

Background: Cancer tissues are specifically responsive to different drugs. For this experiment, predictors are asked to predict the response of each of 54 breast cancer cell lines to a panel of drugs. Data about the tissues include transcriptional profiling, SNP data and copy number profiles measured for cells grown in the absence of any treatment. In addition, the exomes of the cell lines are available upon request to the organizers (please email to, and we are looking into providing VCF files for each of the cell lines.

Prediction challenge: Participants are asked to predict the response of each of the 54 breast cancer cell lines described in the example dataset on the 54 drugs. The prediction should be the GI50 value with standard deviation.

Added information 23 September, 2011: The organizers have learned that in the data files below, there are small discrepancies in the names of the cancer cell lines. Please read this document for clarification and update on the cell line names.

Download datasets:

SNP6.0 CEL data for 53 cell lines NOTE, file size 1.6Gb.
Copy number data
Expression data

Example dataset
The example dataset consists of drug response data on 54 breast cancer cell lines. The three drugs of the example dataset are: BIBW2992, AKT1-2 inhibitor and Erlotinib. For each drug, the GI50 value as measured in each cell line is given.

The prediction dataset including a list 54 drugs and their biological targets

Exome data: The exomes of the cell lines are available upon request to the organizers (please email to

Additional information: Each cell line underwent transcriptional profiling to assess the expression of ~20,000 genes. These represent baseline expression profiles: cells were grown in the absence of any treatment. This data may be downloaded here, or above under the link "Expression data". In the file, each column represents a single breast cancer cell line, and each row represents a single gene. Data are in log2 coordinates.

Copy number profiles represent measures of changes at the DNA level. Normal cells contain two complete copies of each chromosome. Cancer cells are frequently genomically unstable -- regions of the genome can become over-expressed or deleted, leading to either multiple or no copies of a particular gene. The data represent segmented copy number profiles (processed from the SNP6.0 CEL files): after assessing the number of copies at the probe-level, the data are smoothed using circular binary segmentation. Each segment represents a region of the genome with similar copy number. Aberrant regions can be either focal or large. Download here a pdf figure explaining the Copy number data.

Background information on the breast cancer cell line panel used in this study may be found at
Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L, Bayani N, Coppe JP, Tong F, Speed T, Spellman PT, DeVries S, Lapuk A, Wang NJ, Kuo WL, Stilwell JL, Pinkel D, Albertson DG, Waldman FM, McCormick F, Dickson RB, Johnson MD, Lippman M, Ethier S, Gazdar A, Gray JW. (2006) A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. Dec;10(6):515-27. doi:10.1016/j.ccr.2006.10.008

Prediction submission format: The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download submission file template
Download validation script

In the submitted file, each row should include the following columns:

  1. Cell line - The cell line as listed in the example dataset file, use the order as provided in the submission template file.
  2. Drug name, e.g. "17-AAG" - The GI50 value of the drug. Use the order of drugs as listed in the submission template file header.
  3. Standard deviation - Defines the confidence of the prediction in column 2

    ... Repeat columns 2 and 3 for all 54 drugs.

NOTE: Based on information from the data provider, we have learned that for one of the cell lines: "HCC1500", we do not have the experimentally measured GI50 values. Therefore, prediction submissions for this one cell line cannot be assessed. When submitting predictions, predictors are asked to leave "*" in the "Prediction" and "SD" columns of "HCC1500" cell line.

In the template file, cells in columns 2-3 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a cell line or drug, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by
Joe Gray
Joe W. Gray, Lawrence Berkeley National Laboratory.