Predicting the effect of missense mutations on protein stability in Arylsulfatase A
Challenge: ARSA
Variant data: public
Last updated: 27 June 2025
This challenge is open. The challenge closes on September 15, 2025.
Make sure you understand our Data Use Agreement and Anonymity Policy
Summary
Metachromatic Leukodystrophy (MLD) is an autosomal recessive, lysosomal-storage disease caused by mutations in Arylsulfatase A (ARSA) and toxic accumulation of sulfatide substrate. Genome sequencing has revealed hundreds of protein-altering, ARSA missense variants, but the functional effect of most variants remains unknown. ARSA protein stability was measured using a high-throughput cellular degradation assay for a large set of variants. The challenge is to predict the fractional protein stability of each of the 8,867 missense mutant protein at 48 hours post-expression.
Background
Arylsulfatase A (ARSA, E.C. 3.1.6.8, ENST00000216124.5 also known as cerebroside sulfatase) is a lysosomal enzyme that hydrolyzes galactose-3-sulfate residues in a number of lipids and ascorbate-2-sulfate and many phenol sulfates. ARSA is synthesized as a 507 amino acid polypeptide that contains an N-terminal signal peptide. Mature arylsulfatase A is a 51.1 kDa protein with 489 amino acids and three glycosylation sites, Asp 158, 184, and 350, forming an octamer at lysosomal pH value (Stütz & Wrodnigg, 2016). The first reported crystal structure of human ARSA (PDB 1AUK) shows a homooctamer composed of a tetramer of dimers (Lukatela et al., 1998). An unusual structural feature is a formylglycine residue in the hydrate form at position 69 which in concert with an octahedrally coordinated Mg2+ is required for enzymatic activity.
Deficiency of ARSA causes Metachromatic Leukodystrophy (MLD, OMIM #250100), an autosomal recessive lysosomal-storage disorder in which sulfatide buildup in cells, particularly in the brain, spinal cord and peripheral nerves, leads to progressive demyelination disease, resulting in a variety of neurological symptoms and ultimately death (Greene et al., 1967). Patients are generally categorized into four subtypes based on age of onset: late-infantile (0 to 2.5 years, most severe form), early-juvenile (2.5 to 7 years), late-juvenile (7 to 16 years), and adult (16 years and older).
Preliminary results suggest that early intervention, in pre-symptomatic patients, is imperative, underscoring the importance of newborn screening (NBS) for MLD. NBS screening is complicated by pseudodeficiency variants (Patil & Maegawa, 2013) that decrease ARSA activity to 10-15% of the wildtype range without causing MLD, implying that this level of ARSA activity is sufficient for physiologically hydrolyzing sulfatides.
The incidence of MLD is estimated to range between 1:40,000 and 1:160,000 births according to the National Organization for Rare Disorders, with the most severe form being infantile/late-infantile onset MLD, characterized by rapid progression of psychomotor regression resulting in ataxia and weakness with areflexia. Some children have only signs of a progressive peripheral neuropathy during several months, before central nervous system involvement becomes apparent. Death occurs within a few years after the onset of symptoms. The late-infantile form is genetically characterized by homozygosity or compound heterozygosity for alleles that result in complete or near-complete ablation of enzymatic activity of ARSA, resulting in rapid accumulation of sulfatides and rapid disease progression.
In the early-juvenile and late-juvenile forms of MLD, disease progression is slower than in the late-infantile form. However, once neurological signs become more evident, decline tends to be rapid, and patients eventually lose all acquired skills. Spasticity becomes prominent, and many patients also develop epilepsy. The end stage of the disease can last several years, with variable duration. Patients with early- or late-juvenile onset mostly carry one allele that allows for the expression of low levels of residual enzyme activity.
In the adult form of MLD, intellectual and behavioral changes, such as memory deficits or emotional instability, are typically the first presenting symptoms. Mild polyneuropathy often emerges at a later stage, and disease progression is generally slower than in the late-infantile, early-juvenile, or late-juvenile forms. Death usually occurs decades after disease onset. Patients with the adult form of MLD often carry one mild mutation, allowing for the expression of small amounts of functional enzyme, which delays the process of sulfatide accumulation and thus postpones symptom onset.
Experiment
Protein Stability Assay Methodology:
ARSA protein stability was assessed using a novel high-throughput cellular degradation assay. The experimental system uses a doxycycline-inducible expression system with ARSA-EGFP fusion proteins in mammalian cells.
Cell System Design:
Construct Design:
The ARSA cDNA construct contains the native N-terminus, including the signal peptide, as ARSA is a lysosomal enzyme requiring proper subcellular localization. The eGFP reporter is fused to the C-terminus of ARSA with a small linker positioned immediately after the native C-terminal amino acid of ARSA.
Assay Principle:
The assay measures protein degradation over time by monitoring the loss of green fluorescence (EGFP) relative to red fluorescence (mCherry internal control). Misfolded proteins are preferentially degraded by cellular quality control mechanisms, leading to decreased green/red fluorescence ratios.
Experimental Workflow:
Data Characteristics:
Clinical Correlation and Newborn Screening Context:
This assay was developed specifically to support clinical decision-making in newborn screening programs for MLD. Current newborn screening achieves no false positives or false negatives in over 500 million tests, but determining disease severity and treatment urgency requires understanding variant pathogenicity.
Clinical Interpretation Guidelines:
Prediction challenge
Participants are asked to submit predictions on the impact of each of the 8,867 missense variants on ARSA protein stability. The submitted protein stability prediction should be a numeric value representing the percentage of protein remaining at 48 hours post-expression. The scale is:
Each predicted protein stability must include a standard deviation that indicates the confidence in the provided prediction score; i.e., low spread suggests confident predictions and high spread suggests lack of confidence.
Optionally, a comment on the basis of the prediction may be given.
Submission format
The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.
Each data row in the submitted file must include the following columns:
In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a prediction and standard deviation for every variant; if you are not confident in a prediction for a variant, enter a large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.
In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.
File naming
CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model. If you are submitting a single file with all predictions combined, please use the format below.
Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)
To include a description of your method, use the following filename: <teamname>_desc.*
Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.
Sample data
We provide sample data for 348 variants in cagi7arsasample.tsv (see below), which includes protein stability scores from this challenge (first five columns) and enzymatic activity measurements from the previous CAGI6 ARSA challenge (remaining columns, as reported in Supplementary Table S3 of Trinidad et al., 2023). Note that the file contains 349 rows because the missense variant G261R is represented by two different DNA-level SNPs, each with corresponding database information. Note that enzymatic activity does not always correlate with protein stability—some variants may be stable but catalytically inactive. While this sample file includes additional annotations, participants are only required to submit predictions for the columns specified in the submission format.
The participants may wish to use variants from ClinVar, gnomAD, HGMD, MaveDB, and UniProtKB to train and calibrate their models.
Download data
The variants are provided using the HGVS variant nomenclature for protein sequences.
Download the list of 8,867 missense variants: cagi7arsavariantlist.txt
Download the submission template for these 8,867 missense variants: cagi7arsasubmissiontemplate.tsv
Download the sample data of 348 missense variants: cagi7arsasample.tsv and the corresponding legend cagi7arsasamplelegend.txt
Download submission validation script: cagi7arsavalidation.py
Assessment
This challenge follows the tradition of CAGI challenges that assess the predictions of biochemical effects for missense variants. The evaluation protocols and metrics will follow those described by The Critical Assessment of Genome Interpretation Consortium (2024). Predictions will be assessed by an independent assessor.
Potential Evaluation Approaches:
We hope to give participants a sense of how predictions might be assessed without restricting their approach to the problem or predetermining their methodological decisions. Hence, the final evaluation metrics will be released only after the challenge ends.
Dataset provided by
Michael H. Gelb, University of Washington
Related challenges
References
Clark WT, et al. Assessment of predicted enzymatic activity of α-N-acetylglucosaminidase variants of unknown significance for CAGI 2016. Hum Mutat (2019) 40(9):1519-1529. PubMed
Greene H, et al. Arylsulfatase A in the urine and metachromatic leukodystrophy. J Pediatr (1967) 71(5):709-711. PubMed
Lukatela G, et al. Crystal structure of human arylsulfatase A: the aldehyde function and the metal ion at the active site suggest a novel mechanism for sulfate ester hydrolysis. Biochemistry (1998) 37(11):3654-3664. PubMed
Patil SA, Maegawa GH. Developing therapeutic approaches for metachromatic leukodystrophy. Drug Des Devel Ther (2013) 7:729-745. PubMed
Stütz AE, Wrodnigg TM. Carbohydrate-processing enzymes of the lysosome: diseases caused by misfolded mutants and sugar mimetics as correcting pharmacological chaperones. Adv Carbohydr Chem Biochem (2016) 73:225-302. PubMed
The Critical Assessment of Genome Interpretation Consortium. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol (2024) 25(1): 53. PubMed
Trinidad M, et al. Predicting disease severity in metachromatic leukodystrophy using protein activity and a patient phenotype matrix. Genome Biol (2023) 24(1):172. PubMed
Revision history
25 June 2025: challenge preview posted
27 June 2025: minor edits in the description, challenge open