CAGI Scientist Opening: apply here ☀️

Predict effects of missense variants and their A222V dependence for methylenetetrahydrofolate reductase

Challenge: MTHFR

Variant data: public

Last updated: 01 July 2021

This challenge is closed.

How to participate in CAGI6?                         Download data & submit predictions on Synapse 

Make sure you understand Data Use Agreement and Anonymity Policy


Methylenetetrahydrofolate reductase (MTHFR; EC catalyzes the production of 5-methyltetrahydrofolate, which is needed for conversion of homocysteine to methionine. Humans with variants affecting MTHFR function present with a wide range of phenotypes, including homocystinuria, homocysteinemia, developmental delay, severe mental retardation, psychiatric disturbances, and late-onset neurodegenerative disorders. A further complication to interpretation of variants in this gene is a common variant, Ala222Val, carried by a large fraction of the human population. A large library of MTHFR missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these variants in two different settings: (1) for the wildtype protein, and (2) for the protein with the common Ala222Val variant.


MTHFR is a 75 kDa (656 aa) protein involved in the tetrahydrofolate interconversion pathway and is the rate-limiting enzyme in the one carbon (methyl) cycle. In humans, the Ala222Val polymorphism (global allele frequency ~30%) which encodes a thermolabile form of MTHFR resulting in decreased enzyme activity, has been identified as a risk factor in a broad range of multifactorial disorders; see Liu and Gupta (2015) for a review. The reduced activity of the Ala222Val variant results from increased dissociation of the enzyme’s co-factor flavine adenine dinucleotide (FAD). This dissociation is linked to changes in the protein’s quaternary structure (Yamada et al., 2001). Ala222Val is associated with a mild increase of plasma homocysteine levels and, in homozygous pregnant women, with an increased risk of neural tube birth defects (Blom & Smulders, 2011). Other reported associations are controversial (Clarke et al., 2012; van Meurs et al., 2013; Smulders & Blom, 2011). However, around 1/3rd of rare or private variants in MTHFR occur in cis with Ala222Val (1000 Genomes Project Consortium et al., 2010; Froese et al., 2016), potentially modulating the effects of other MTHFR variants and complicating MTHFR variant interpretation.

A team in Fritz Roth’s Lab at the Donnelly Centre (University of Toronto) and Lunenfeld-Tanenbaum Research Institute (Sinai Health Systems), led by Jochen Weile and Song Sun, has assessed a large library of MTHFR variants in both Ala222Val and wildtype backgrounds, using a high-throughput yeast complementation assay. This assay reveals the overall impact of each variant on the ability of the protein to function in the cell.


A diverse library of MTHFR missense variants was generated by a random codon replacement method called POPCode (Weile et al., 2017). The Roth lab implemented a previously described and validated yeast-based functional complementation assay (Marini et al., 2009) that is amenable to variant effect mapping of MTHFR via two steps: (1) confirming a complementation relationship, in which MTHFR can rescue the phenotypic defect of a knockout of the orthologous essential yeast gene MET13 with the yeast enzyme FOL3 additionally deleted to enable external control of intracellular folate and (2) assessing the loss of rescue for a test set of likely damaging and likely neutral variants. 

MET13 (UniProtKB: P53128) is an essential gene, and in the absence of methionine supplementation, met13Δ mutants do not grow. The pooled library of MTHFR variants was transformed into the S. cerevisiae met13Δ fol3Δ strain. Assays were performed in duplicate. Two samples were taken from the pooled transformants as pre-selection technical replicates. Two further aliquots were used to start parallel cultures for the two post-selection replicates, which were assayed by growing to saturation on minimal media lacking methionine. In parallel, the same selection was performed on the S. cerevisiae met13Δ fol3Δ strain expressing the wildtype allele of MTHFR and two samples were taken as wildtype control replicates. Plasmid DNA was extracted from the six samples followed by TileSeq, a method based on the amplification and sequencing of small tiles across the gene that are short enough to allow paired-end sequencing to read both strands on each cluster on an Illumina flowcell. When reads from both strands agree on the presence of a variant, it is counted.

These maps have been validated by their ability to separate pathogenic from non-pathogenic variants (Weile et al., 2021) with performance exceeding that of all of the several computational methods that were evaluated. The maps were generated using essentially the methods described by Weile et al. (2017) and Sun et al. (2020). Briefly, read counts in the pre-selection, post-selection and wildtype-control conditions for each variant were normalized to sequencing depth and then used to calculate allele frequency enrichment. First, the wildtype-control counts were subtracted from the pre- and post-selection counts, as they are assumed to represent position-dependent sequencing or PCR errors, which would otherwise artificially shrink the ratio of post- to pre-selection counts towards 1. Then, the log ratio between the post- and pre-selection counts was calculated. Finally, the log ratio distributions of synonymous and nonsense variants (which, for simplicity, are assumed to emulate wildtype- and null-like behavior) were used to rescale all other variant log ratios, such that 1 represents full function and 0 represents complete loss of function. The resulting quantities are referred to as fitness scores below. The two replicates for each measurement were used to estimate measurement errors, and these were regularized using an established procedure (Baldi & Long, 2001).

Prediction challenge

Participants are asked to submit predictions of the fitness score for each missense variant on competitive growth on a log scale in two prediction modes: (1) wildtype and (2) in the presence of Ala222Val. Submitted predictions on each variant in each of the two prediction modes should be a score greater than or equal to 0: 0 = no growth at the restrictive temperature, 1 = wildtype-like growth fitness, >1 = improved growth fitness. The total number of variants is 17,390. Please note that the experimental scores are a measure of fitness in a competitive growth assay and have not been calibrated to correspond to percent of wildtype protein function. To assist with calibrating predictions, we provide the overall distribution of map scores. Predictors should also bear in mind that this experiment assays the effect of human protein variants in a yeast system. Although human MTHFR (UniProtKB: P42898) is very similar to its yeast counterpart, predictors may wish to familiarize themselves with literature regarding minor biochemical differences. To help participants calibrate/scale their numeric values appropriately, we provide the experimental distribution of numeric growth fitness scores.

The predictions will be assessed against the numeric values calculated for each mutant clone in the competitive growth assay. Each predicted value must include a standard error and it is anticipated that these will be included in the assessment. A brief comment for each prediction may also be given.

Prediction submission format 

The prediction submissions (one for each of the wildtype and Ala222Val variant maps) are tab-delimited text files. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Each data row in the submitted files must include the following columns:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. For a given subset, you must submit predictions and standard deviations for all or none of the variants; if you are not confident in a prediction for a variant, enter an appropriately large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly. 

In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_<target>_model_(1|2|3|4|5|6).(tsv|txt)

target = cataWT (for catalytic region wildtype), cataAV (for catalytic region A222V), reguWT (for regulatory region wildtype) or reguAV (for regulatory region A222V).

To include a description of your method, use the following filename: <teamname>_desc.*

Examples: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3 for the catalytic region of the MTHFR with the A222V variant, your filename should be bestincagi_cataAV_model_3.tsv. If your prediction concerns the wildtype molecule, the filename should be bestincagi_cataWT_model_3.tsv

Download data 

Variants are split for the catalytic and regulatory domains of the protein. Two files are available for each, one for the wildtype and another for the molecule with A222V.

Download dataset:

Download experimental distributions:

Download submission template: 

Download submission validation script: 

Training data 

No training data is provided. The participants may wish to use known variants from ClinVar, gnomAD, HGMD, and UniProtKB to calibrate their models. Variants from early- and late-onset patients are provided by Froese et al. (2016).


Predictions will be assessed by an independent assessor. We anticipate a range of evaluation scenarios traditionally used in CAGI (R-square, correlation, rank correlation between predictions and experimental observations). The independent assessor is expected to emphasize certain performance measures over others as well as to employ other evaluation approaches. We also anticipate separate assessments on single-nucleotide accessible amino acid substitutions and amino acid substitutions requiring more than one nucleotide change to the codon.

Related challenges

Dataset citation 

Weile J, et al. Shifting landscapes of human MTHFR missense variant effects. Am J Hum Genet (2021) 108(7): 1283-1300. PubMed 


Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics (2001) 17(6):509-519. PubMed 

Liew SC, Gupta ED. Methylenetetrahydrofolate reductase (MTHFR) C677T polymorphism: epidemiology, metabolism and the associated diseases. Eur J Med Genet (2015) 58(1):1-10. PubMed 

Sun S, et al. An extended set of yeast-based functional assays accurately identifies human disease mutations. Genome Res (2016) 26(5):670-680. PubMed 

Weile J, et al. A framework for exhaustively mapping functional missense variants. Mol Syst Biol (2017) 13(12): 957. PubMed 

Yamada K, et al. Effects of common polymorphisms on the properties of recombinant human methylenetetrahydrofolate reductase. Proc Natl Acad Sci U S A (2001) 98(26):14853-14858. PubMed 

Revision history 

03 May 2021: challenge released, challenge opens

20 May 2021: close date changed; submission template and validation script aded, file format instructions added

08 June 2021: expected scores modified to allow scores > 1. Validation script modified.

15 June 2021: deadline extended to June 24, 2021

22 June 2021: deadline soft-extended to June 30, 2021 (registered participants notified by email)

30 June 2021: challenge closed

01 July 2021: the challenge data published by Weile et al. Am J Hum Genet (2021) and publicly available