Critical Assessment of Genome Interpretation

The fitness of tagged transposon insertions in Shewanella oneidensis strain MR-1 under stress conditions

Dataset description: public

Prediction dataset: public

The CAGI submission deadline for the MR-1 challenge has passed as of October 4, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

Background

Shewanella oneidensis strain MR-1 (formerly known as S. putrefaciens) is a model organism for studying metal reduction, as MR-1 can utilize a wide range of metal ions and solid metals as electron acceptors and also grows aerobically. MR-1 is in the same division of bacteria as E. coli (the γ-Proteobacteria), but they are not closely related. Of the ~4,500 proteins in MR-1, only about a third have orthologs in E. coli. The MR-1 genome sequence was published in 2002 (Heidelberg et al., 2002; doi:10.1038/nbt749 and the annotation has been curated since. A few hundred papers have been published on MR-1, and hundreds of gene expression experiments are publicly available.

The Adam Arkin Lab at UC Berkeley has created a large number of S. oneidensis MR-1 transposon insertions with known location and with a known tag or barcode. These insertions are pooled together into two pools, and the pools are grown under a given (stress) condition for ~6-8 generations. Typically, the stress experiments are performed in LB media with the stressor in well-shaken (aerobic) flasks, and a concentration of the stressor that reduces the growth rate about 2-fold is used.

The abundance of each tagged strain is measured with microarray at the beginning and at the end of the experiment. The fitness of the strain is the log2 ratio of these abundances. (This is not the same scale as fitness in population genetics.) The data is normalized so that the median strain has a fitness of 0. The fitness value of a gene is computed as the average of the values for the insertions in that gene. In this experiment it is assumed that the insertions of a given gene deactivate that gene.

The reliability of these per-gene fitness values is estimated by looking at consistency across different insertions in the same gene and at consistency across the two pools. In a typical experiment, some strains are very sick (fitness < -2 imply little or no growth), some strains are moderately but significantly sick (fitness ~ -1), most strains have fitness near 0 (are neutral), and a handful of strains are advantaged (fitness ~ 1).

A study of MR-1 gene-phenotype relationships for 121 conditions has already been published (Deutschbaue et al., 2011). The 2012 challenge is to predict results under eight more conditions.

Prediction challenge

Predictors are asked to submit predictions on how insertions in the given gene of MR-1 affect the fitness of that gene in a given condition (stressor). The submitted prediction should be a numeric value with a standard deviation. The predictions will be assessed against the numeric fitness values (log2 ratios) actually measured for each gene in each condition.

Download: list of genes in MR-1

Tab-delimited format, column descriptions:

scaffoldId: 139 for the main chromosome; 140 for the megaplasmid.
locusId: locus identifier in MicrobesOnline.org also referred to as VIMSS id, or as "gene" in the header of some tables.
ec and ecName: the b-number identifier and the gene name for the E. coli ortholog, if any.

Descriptions of 8 challenge experiments (the conditions)

Tab-delimited format, column descriptions:

Chip: the identifier for the experiment; used as the column name in the data tables.
Group and info: the type of experiment and a description of the condition.

Unless otherwise specified, a defined medium with lactate, ammonia, sulfate, and phosphate is used, in air (well shaken).
Anaerobic experiments list an alternate electron acceptor and often list an alternate carbon source (otherwise lactate is the carbon source and electron donor).
Stress experiments are in rich media (LB). Some of these are plain LB experiments (no stress added). Generally enough stress to cut the growth rate in half is added.

Prediction Dataset

The CAGI challenge consists of 8 fitness experiments over 4632 MR-1 genes. Of these, we will reveal the fitness scores of 1732 genes over 4 conditions. Thus the prediction dataset will be the remaining 2900 genes in the 4 four conditions, and the whole gene set (4632 genes) in the remaining 4 conditions.

Fitness data for half the genes in half the challenge experiments

Tab-delimited format, column descriptions:

Good: TRUE if there are insertion(s) within the central 5-80% of the gene.
Numeric column names are chip numbers.
The values are log2 ratios of the abundance at the end of the experiment versus the beginning, which is usually 6-8 generations of growth. These "fitness values" are normalized so that the median is zero.

Example Results

The dataset providing example results contains 195 fitness experiments on MR-1.

Descriptions of 195 fitness experiments

For file format definition, please see the description for the Prediction dataset conditions above.

Fitness data for 195 experiments

For file format definition, please see the description for the Prediction dataset conditions above.

Additional data files on the 195 fitness experiments:

Z scores for the fitness values for 195 experiments

The up-tag and down-tag pools
offset is where the insertion is relative to the start of the gene (i.e. 0 if at the start codon of a protein-coding gene)

per-strain and per-pool fitness data for the 195 experiments

Viewing the data in MeV:

Download MR1_fitness.mev
Run MeV, use the File / Load Data command, use Browse to select MR1_fitness.mev, select two-color array, uncheck load annotation, and select the upper/left-most expression value (under Fe(III)).
Check that the color scheme limits are reasonable (i.e., -3 to 0 to 3) using Display / Set Color Scale Limits.
Use Display / Gene Row Labels to set "comb" as the gene row label (this will show the VIMSS/MicrobesOnline id, the SO number, the gene's name if any, and the gene's description).
You might also want to use Display / Set Element Size and Analysis / Clustering / HCL.

Other resources:

Browse the fitness data at MicrobesOnline
A universal TagModule collection for parallel genetic analysis of microorganisms (Oh et al. 2010)

This describes the Arkin Lab methods, but some changes has been made since the paper was written: currently using many more strains; using two pools rather than one; and using a single time point for most experiments.

Fitness data for E. coli

Note, the E. coli data uses a rather different measurement technology (camera to measure the size of colonies on 1,000-well plates). Also, the E. coli values are significance scores, not fitness values.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download MR-1 submission template

Download MR-1 submission validation script (not available).

In the submitted file, each row should include the following columns:

The first column has Gene/LocusID information and
columns 2-9 should have prediction of the fitness value for conditions 300, 302, 321, 322, 328, 332, 333 and 336 as defined in Descriptions of 8 experimental conditions for the challenge.

In the template file, some cells in columns 2-9 have pre-filled numbers. These are derived from the revealed experiments as defined in Fitness data for half the genes in half the challenge experiments. The rest of the cells in columns 2-9 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a gene, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly. Each prediction column should have prediction and standard deviation in the following format: Prediction Value (Standard Deviation).

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

References

The Arkin Lab method for making and assaying pools of transposon-tagged mutants in bacteria: Oh J, Fung E, Price MN, Dehal PS, Davis RW, Giaever G, Nislow C, Arkin AP, Deutschbauer A. A universal TagModule collection for parallel genetic analysis of microorganisms. Nucleic Acids Res. 2010 Aug;38(14):e146. doi: 10.1093/nar/gkq419

Fitness data in yeast using a similar approach (although these were tagged clean deletions, not transposon insertions): Giaever G et al., Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 387-391. doi:10.1038/nature00935

Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M, St Onge RP, Tyers M, Koller D, Altman RB, Davis RW, Nislow C, Giaever G. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008 Apr 18;320(5874):362-5. doi: 10.1126/science.1150021

Fitness data in E. coli (using 1000-well plates and cameras, not pools and microarrays): Nichols RJ, Sen S, Choo YJ, Beltrao P, Zietek M, Chaba R, Lee S, Kazmierczak KM, Lee KJ, Wong A, Shales M, Lovett S, Winkler ME, Krogan NJ, Typas A, Gross CA. Phenotypic landscape of a bacterial cell. Cell. 2011 Jan 7;144(1):143-56. doi:10.1016/j.cell.2010.11.052

Data provided by

Adam M. Deutschbauer, Morgan N. Price, Kelly Wetmore, Wenjun Shao, Jason Baumohl, and Adam P. Arkin from UC Berkeley, and Michelle Nyguyen, Raquel Tamse, Ronald W. Davis from Stanford University.

Center for Critical Assessment of Genome Interpretation

Register/Login

Critical Assessment of Genome Interpretation

Descriptions of 8 challenge experiments (the conditions)

Fitness data for half the genes in half the challenge experiments

Descriptions of 195 fitness experiments

Fitness data for 195 experiments