CAGI Scientist Opening: apply here ☀️

Breast cancer cell line pharmacogenomics dataset

The experimental results for this challenge have been published: Heiser LM et al., Subtype and pathway specific responses to anticancer compounds in breast cancer.

Dataset description: public

SNP data (including expression and CNV data): public

Cell line exomes: registered users only, limited by CAGI Data Use Agreement

The CAGI submission deadline for the cancer cell lines challenge has passed as of September 30, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.


Cancer tissues are specifically responsive to different drugs. For this experiment, predictors are asked to predict the response of each of 54 breast cancer cell lines to a panel of drugs. Data about the tissues include transcriptional profiling, SNP data and copy number profiles measured for cells grown in the absence of any treatment. In addition, the exomes of the cell lines are available upon request to the organizers (please email to, and we are looking into providing VCF files for each of the cell lines.

Prediction challenge

Participants are asked to predict the response of each of the 54 breast cancer cell lines described in the example dataset on the 54 drugs. The prediction should be the GI50 value with standard deviation.

Added information 23 September, 2011: The organizers have learned that in the data files below, there are small discrepancies in the names of the cancer cell lines. Please read this document for clarification and update on the cell line names.

Download datasets:

SNP6.0 CEL data for 53 cell lines NOTE, file size 1.6Gb.

Copy number data

Expression data

Example dataset

The example dataset consists of drug response data on 54 breast cancer cell lines. The three drugs of the example dataset are: BIBW2992, AKT1-2 inhibitor and Erlotinib. For each drug, the GI50 value as measured in each cell line is given.

The prediction dataset including a list 54 drugs and their biological targets

Exome data: The exomes of the cell lines are available upon request to the organizers (please email to

Additional information

Each cell line underwent transcriptional profiling to assess the expression of ~20,000 genes. These represent baseline expression profiles: cells were grown in the absence of any treatment. This data may be downloaded here, or above under the link "Expression data". In the file, each column represents a single breast cancer cell line, and each row represents a single gene. Data are in log2 coordinates.

Copy number profiles represent measures of changes at the DNA level. Normal cells contain two complete copies of each chromosome. Cancer cells are frequently genomically unstable -- regions of the genome can become over-expressed or deleted, leading to either multiple or no copies of a particular gene. The data represent segmented copy number profiles (processed from the SNP6.0 CEL files): after assessing the number of copies at the probe-level, the data are smoothed using circular binary segmentation. Each segment represents a region of the genome with similar copy number. Aberrant regions can be either focal or large. Download here a pdf figure explaining the Copy number data.

Background information on the breast cancer cell line panel used in this study may be found at

Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L, Bayani N, Coppe JP, Tong F, Speed T, Spellman PT, DeVries S, Lapuk A, Wang NJ, Kuo WL, Stilwell JL, Pinkel D, Albertson DG, Waldman FM, McCormick F, Dickson RB, Johnson MD, Lippman M, Ethier S, Gazdar A, Gray JW. (2006) A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. Dec;10(6):515-27. doi:10.1016/j.ccr.2006.10.008

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download submission file template

Download validation script (not available).

In the submitted file, each row should include the following columns:

          ... Repeat columns 2 and 3 for all 54 drugs.

NOTE: Based on information from the data provider, we have learned that for one of the cell lines: "HCC1500", we do not have the experimentally measured GI50 values. Therefore, prediction submissions for this one cell line cannot be assessed. When submitting predictions, predictors are asked to leave "*" in the "Prediction" and "SD" columns of "HCC1500" cell line.

In the template file, cells in columns 2-3 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a cell line or drug, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by

Joe W. Gray, Lawrence Berkeley National Laboratory.