CAGI Scientist Opening: apply here ☀️
Predict missense variant effects on hydroxymethylbilane synthase as measured by a yeast complementation assay
Variant data: public
Last updated: 12 October 2021
This challenge is closed.
Make sure you understand our Data Use Agreement and Anonymity Policy
Hydroxymethylbilane synthase (HMBS), also known as porphobilinogen deaminase (PBGD) or uroporphyrinogen I synthase, is an enzyme involved in heme production. In humans, variants that affect HMBS function result in acute intermittent porphyria (AIP), an autosomal dominant genetic disorder caused by a build-up of porphobilinogen in the cytoplasm. A large library of HMBS missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these variants.
HMBS is a 40-42 kDa (344-361 aa) protein involved in the third step of heme biosynthesis. There are two isoforms, one ubiquitous in all tissues, and one restricted to erythrocytes. The ubiquitous isoform is generated by alternative splicing of exon 1 and contains an additional 17 amino acid residues at the N terminus. The HMBS structure comprises three domains and a single catalytic site, with interdomain flexibility contributing to elongation of the polypyrrole product (Louie et al., 1992). Enzyme deficiency caused by mutations in the HMBS gene in combination with environmental factors can trigger acute intermittent porphyria (AIP), a condition characterized by tachycardia, arrhythmias and hypertension, seizures, and damage to nerves and muscles (peripheral neuropathy) which can lead to paralysis. While the exact underlying mechanisms involved in the development of symptomatic AIP are unknown, the measurement of low HMBS activity in erythrocytes facilitates detection of AIP during latent periods and acute episodes. However, the detection of normal activity does not exclude the ‘non-erythroid variant' of AIP (Linenberger & Fertrin, 2020). Mostly for this reason, molecular genetic diagnosis is the most sensitive and preferred method for the diagnosis of classical and variant AIP. Moreover, it permits identification of latent carriers within a single family, allowing these individuals to be aware of factors that may precipitate acute attacks.
A team in Fritz Roth’s Lab at the Donnelly Centre (University of Toronto) and Lunenfeld-Tanenbaum Research Institute (Sinai Health Systems), led by Warren van Loggerenberg, has assessed a large library of HMBS variants using a high-throughput yeast complementation assay. This assay reveals the overall impact of each variant on the ability of the protein to function in the cell.
A diverse library of plasmids expressing different human HMBS (UniProtKB: P08397) variants was generated by a random codon replacement method called POPCode (Weile et al., 2017). The Roth lab had developed a yeast-based functional complementation assay that is amenable to variant effect mapping of HMBS via two steps: (1) confirming a complementation relationship, in which HMBS can rescue the phenotypic defect of a loss-of-function mutation in the orthologous essential yeast gene HEM3; and (2) assessing the loss of rescue for a test set of likely damaging and likely neutral variants. The yeast-based functional complementation assay was validated for the human HMBS gene by measuring the impact of four variants of which two (50% recall) were detected at a stringency yielding 100% precision (all non-pathogenic variants complemented), thus offering performance on par with previous human disease gene complementation assays (Sun et al., 2016). HEM3 catalyzes head-to-tail condensation of four units of porphobilinogen (PBG) to form the linear tetrapyrrole hydroxymethylbilane (HMB) (Bogorad, 1958; Anderson & Desnick, 1980). The holoenzyme is formed by the apoprotein spontaneously assembling the unique cofactor dipyrromethane (DPM) from two molecules of PBG. DPM initiates the polymerization reaction, and release of the unstable tetrapyrrole product by hydrolysis restores the holoenzyme with covalently bound DPM, which continues to act as a primer for HMB formation (Jordan et al., 1988). HEM3 (UniProtKB: P28789) is an essential gene, and at restrictive temperature, HEM3 temperature-sensitive mutants do not grow. Pooled libraries of ubiquitous and erythroid-specific HMBS variants were transformed into the temperature-sensitive S. cerevisiae HEM3 strain. For each isoform, two samples were taken from the pooled transformants as pre-selection technical replicates. Two further aliquots were used to start parallel cultures which were grown to saturation at the selective temperature of 36°C, from which two post-selection technical replicate samples were taken. Meanwhile, the same selection was performed on the temperature-sensitive S. cerevisiae HEM3 strain expressing either wildtype HMBS isoform, and two samples were taken as wildtype control replicates. Plasmid DNA was extracted from the six samples followed by TileSEQ, a sequencing method based on the amplification of small tiles across the gene that are short enough to allow paired-end sequencing to read both strands on each cluster on an Illumina flowcell. When reads from both strands agree on the presence of a variant, it is counted.
The yeast-based functional assays were established and validated, and maps for both HMBS isoforms have been validated by their ability to separate pathogenic from non-pathogenic variants (van Loggerenberg et al., in preparation). The methods are described in (Weile et al., 2017). Briefly, read counts in the pre-selection, post-selection and wildtype-control conditions for each variant were normalized to sequencing depth and then used to calculate allele frequency enrichment. First, the wildtype control counts were subtracted from the pre- and post-selection counts (as they are assumed to represent position-dependent sequencing errors). Then, the log ratio between the post- and pre-selection counts was calculated. Finally, the log ratio distributions of synonymous and nonsense variants (which, for simplicity, are assumed to emulate wildtype- and null-like behavior) were used to rescale all other variant log ratios, such that 1 represents full function and 0 represents complete loss of function. The two replicates for each measurement were used to estimate measurement errors, and these were regularized using an established procedure (Baldi & Long, 2001). For each variant a weighted average score inferring error between maps was calculated, where the weight corresponded to inverse-squared standard error estimates. While variant effect maps for either HMBS isoform reliably discriminate pathogenic from benign alleles, our weighted average map achieves the best-performance. The resulting quantities in the combined HMBS map are referred to as fitness scores below.
Participants are asked to submit predictions of the fitness score for each of 6,894 variants on competitive growth on a log scale. The submitted predictions should be numeric values on a log scale greater than or equal to 0. The score of 0 = no growth at the restrictive temperature, 1 = wildtype-like growth fitness, and >1 = improved fitness. Please note: the experimental scores are a measure of fitness in a competitive growth assay and have not been calibrated to correspond to percent of wildtype protein function. Predictors should also bear in mind that this experiment assays the effect of human protein variants in a yeast system. To help participants calibrate their numeric values appropriately, we provide the experimental distribution of numeric growth fitness scores.
The predictions will be assessed against the numeric values calculated for each mutant clone in the competitive growth assay. Each predicted value must include a standard error. Predictions will also be assessed on the standard errors. A brief comment for the prediction may also be given.
Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.
Each data row in the submitted file must include the following columns:
In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard deviations for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriately large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.
In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.
CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.
Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)
To include a description of your method, use the following filename: <teamname>_desc.*
Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.
The variants are provided using the HGVS variant nomenclature.
Download dataset: hmbsvariantlist.txt
Download experimental distribution: hmbsdistribution.xls
Download submission template: hmbsvarianttemplate.tsv
Download submission validation script: hmbsvalidation.py
No training data is provided. The participants may wish to use known variants from ClinVar, gnomAD, HGMD, and UniProtKB to calibrate their models.
Predictions will be assessed by an independent assessor. We anticipate a range of evaluation scenarios including the R-square, correlation, and rank correlation between predictions and experimental observations. The independent assessor is expected to emphasize certain performance measures over others as well as to employ other evaluation approaches. We also anticipate separate assessments on single-nucleotide accessible amino acid substitutions and amino acid substitutions requiring more than one nucleotide change to the codon. Finally, the assessors may use the submissions to match the score distributions of their predictions to the observations.
Anderson P, Desnick R. Purification and properties of uroporphyrinogen I synthase from human erythrocytes. Identification of stable enzyme-substrate intermediates. J Biol Chem (1980) 255(5): 1993-1999. PubMed
Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics (2001) 17(6):509-519. PubMed
Bogorad L. The enzymatic synthesis of porphyrins from porphobilinogen. I. Uroporphyrin I. J Biol Chem (1958) 233(2): 501-509. PubMed
Jordan PM, et al. Purification, crystallization and properties of porphobilinogen deaminase from a recombinant strain of Escherichia coli K12. Biochem J (1988) 254(2): 427-435. PubMed
Linenberger M, Fertrin KY. Updates on the diagnosis and management of the most common hereditary porphyrias: AIP and EPP. Hematology 2014, the American Society of Hematology Education Program Book 2020(1), 400-410. PubMed
Louie GV, et al. Structure of porphobilinogen deaminase reveals a flexible multidomain polymerase with a single catalytic site. Nature (1992) 359: 33-39. PubMed
Pierarch CA, et al. Red blood cell phorphobilinogen deaminase in the evaluation of acute intermittent porphyria. JAMA (1987) 257(1):60-61. PubMed
Sun S, et al. An extended set of yeast-based functional assays accurately identifies human disease mutations. Genome Res (2016) 26(5):670-680. PubMed
Weile J, et al. A framework for exhaustively mapping functional missense variants. Mol Syst Biol (2017) 13(12): 957. PubMed
Dataset provided by
Warren van Loggerenberg, Jochen Weile, Song Sun, and Fritz Roth, University of Toronto
03 May 2021: initial release
01 June 2021: file format added, minor fixes in challenge description
08 June 2021: validation script updated, the scores expected in predictions updated to allow scores >1
14 August 2021: submission deadline extended to September 30
30 September 2021: submission deadline extended to October 11
11 October 2021: challenge closed