The fitness of tagged transposon insertions in Shewanella oneidensis strain MR-1 under stress conditions

Dataset description: public
Prediction datasets: public

The CAGI submission deadline for the MR-1 challenge has passed as of October 4, 2011, 3PM EDT. We welcome the upload of predictions after the deadline for archival and comparison purposes, but these post-deadline submissions are not part of the CAGI experiment.

Background: Shewanella oneidensis strain MR-1 (formerly known as S. putrefaciens) is a model organism for studying metal reduction, as MR-1 can utilize a wide range of metal ions and solid metals as electron acceptors and also grows aerobically. MR-1 is in the same division of bacteria as E. coli (the γ-Proteobacteria), but they are not closely related. Of the ~4,500 proteins in MR-1, only about a third have orthologs in E. coli. The MR-1 genome sequence was published in 2002 (Heidelberg et al., 2002; doi:10.1038/nbt749) and the annotation has been curated since. A few hundred papers have been published on MR-1, and hundreds of gene expression experiments are publicly available.

The Adam Arkin Lab at UC Berkeley has created a large number of S. oneidensis MR-1 transposon insertions with known location and with a known tag or barcode. These insertions are pooled together into two pools, and the pools are grown under a given (stress) condition for ~6-8 generations. Typically, the stress experiments are performed in LB media with the stressor in well-shaken (aerobic) flasks, and a concentration of the stressor that reduces the growth rate about 2-fold is used.

The abundance of each tagged strain is measured with microarray at the beginning and at the end of the experiment. The fitness of the strain is the log2 ratio of these abundances. (This is not the same scale as fitness in population genetics.) The data is normalized so that the median strain has a fitness of 0. The fitness value of a gene is computed as the average of the values for the insertions in that gene. In this experiment it is assumed that the insertions of a given gene deactivate that gene.

The reliability of these per-gene fitness values is estimated by looking at consistency across different insertions in the same gene and at consistency across the two pools. In a typical experiment, some strains are very sick (fitness < -2 imply little or no growth), some strains are moderately but significantly sick (fitness ~ -1), most strains have fitness near 0 (are neutral), and a handful of strains are advantaged (fitness ~ 1).

Prediction challenge: Predictors are asked to submit predictions on how insertions in the given gene of MR-1 affect the fitness of that gene in a given condition (stressor). The submitted prediction should be a numeric value with a standard deviation. The predictions will be assessed against the numeric fitness values (log2 ratios) actually measured for each gene in each condition.

Download:

list of genes in MR-1

    Tab-delimited format, column descriptions:
  • scaffoldId: 139 for the main chromosome; 140 for the megaplasmid.
  • locusId: locus identifier in MicrobesOnline.org, also referred to as VIMSS id, or as "gene" in the header of some tables.
  • ec and ecName: the b-number identifier and the gene name for the E. coli ortholog, if any.

Descriptions of 8 challenge experiments (the conditions)

    Tab-delimited format, column descriptions:
  • Chip: the identifier for the experiment; used as the column name in the data tables.
  • Group and info: the type of experiment and a description of the condition.
    • Unless otherwise specified, a defined medium with lactate, ammonia, sulfate, and phosphate is used, in air (well shaken).
    • Anaerobic experiments list an alternate electron acceptor and often list an alternate carbon source (otherwise lactate is the carbon source and electron donor).
    • Stress experiments are in rich media (LB). Some of these are plain LB experiments (no stress added). Generally enough stress to cut the growth rate in half is added.

Prediction dataset: The CAGI challenge consists of 8 fitness experiments over 4632 MR-1 genes. Of these, we will reveal the fitness scores of 1732 genes over 4 conditions. Thus the prediction dataset will be the remaining 2900 genes in the 4 four conditions, and the whole gene set (4632 genes) in the remaining 4 conditions.

Fitness data for half the genes in half the challenge experiments

    Tab-delimited format, column descriptions:
  • Good: TRUE if there are insertion(s) within the central 5-80% of the gene.
  • Numeric column names are chip numbers
  • The values are log2 ratios of the abundance at the end of the experiment versus the beginning, which is usually 6-8 generations of growth. These "fitness values" are normalized so that the median is zero.

Example results: The dataset providing example results contains 195 fitness experiments on MR-1.

Descriptions of the 195 fitness experiments
For file format definition, please see the description for the Prediction dataset conditions above.

Fitness data for 195 experiments
For file format definition, please see the description for the Prediction dataset fitness values above.

Additional data files on the 195 fitness experiments:

Viewing the data in MeV:

  1. Download MR1_fitness.mev.
  2. Run MeV, use the File / Load Data command, use Browse to select MR1_fitness.mev, select two-color array, uncheck load annotation, and select the upper/left-most expression value (under Fe(III)).
  3. Check that the color scheme limits are reasonable (i.e., -3 to 0 to 3) using Display / Set Color Scale Limits.
  4. Use Display / Gene Row Labels to set "comb" as the gene row label (this will show the VIMSS/MicrobesOnline id, the SO number, the gene's name if any, and the gene's description).
  5. You might also want to use Display / Set Element Size and Analysis / Clustering / HCL.

Other resources:

  • Browse the fitness data at MicrobesOnline
  • A universal TagModule collection for parallel genetic analysis of microorganisms (Oh et al. 2010)
    • This describes the Arkin Lab methods, but some changes has been made since the paper was written: currently using many more strains; using two pools rather than one; and using a single time point for most experiments.
  • Fitness data for E. coli
    • Note, the E. coli data uses a rather different measurement technology (camera to measure the size of colonies on 1,000-well plates). Also, the E. coli values are significance scores, not fitness values.

References:
The Arkin Lab method for making and assaying pools of transposon-tagged mutants in bacteria:
Oh J, Fung E, Price MN, Dehal PS, Davis RW, Giaever G, Nislow C, Arkin AP, Deutschbauer A. A universal TagModule collection for parallel genetic analysis of microorganisms. Nucleic Acids Res. 2010 Aug;38(14):e146. doi: 10.1093/nar/gkq419

Fitness data in yeast using a similar approach (although these were tagged clean deletions, not transposon insertions):
Giaever G et al., Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 387-391. doi:10.1038/nature00935

Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M, St Onge RP, Tyers M, Koller D, Altman RB, Davis RW, Nislow C, Giaever G. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008 Apr 18;320(5874):362-5. doi: 10.1126/science.1150021

Fitness data in E. coli (using 1000-well plates and cameras, not pools and microarrays):
Nichols RJ, Sen S, Choo YJ, Beltrao P, Zietek M, Chaba R, Lee S, Kazmierczak KM, Lee KJ, Wong A, Shales M, Lovett S, Winkler ME, Krogan NJ, Typas A, Gross CA. Phenotypic landscape of a bacterial cell. Cell. 2011 Jan 7;144(1):143-56.
doi:10.1016/j.cell.2010.11.052

Prediction submission format: The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download MR-1 submission template
Download MR-1 submission validation script

In the submitted file, each row should include the following columns:

  1. Gene/LocusID - The gene in question, please use the order as provided in the template file
  2. 300 - Prediction of the fitness value (log2 ratio) for condition 300, which is defined in Descriptions of 8 challenge experiments (the conditions)
  3. 302 - Prediction of the fitness value (log2 ratio) for condition 302, which is defined in Descriptions of 8 challenge experiments (the conditions)

    … for all 8 conditions as defined in Descriptions of 8 challenge experiments (the conditions).

In the template file, some cells in columns 2-9 have pre-filled numbers. These are derived from the revealed experiments as defined in Fitness data for half the genes in half the challenge experiments. The rest of the cells in columns 2-9 are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a gene, leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link:"All submission forms" from the front page of your group. For more details, please read the FAQ page.

Dataset provided by
Adam M. DeutschbauerMorgan N. PriceKelly WetmoreWenjun ShaoJason BaumohlAdam Arkin
Adam M. Deutschbauer, Morgan N. Price, Kelly Wetmore, Wenjun Shao, Jason Baumohl, and Adam P. Arkin from UC Berkeley, and Michelle Nyguyen, Raquel Tamse, Ronald W. Davis from Stanford University.