CAGI Scientist Opening: apply here ☀️

Predicting the effect of naturally occurring missense mutations on enzymatic activity in Arylsulfatase A

Challenge: ARSA

Variant data: public

Last updated: 16 November 2022

This challenge is closed.

How to participate in CAGI?                              Download data & submit predictions through Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy


Metachromatic Leukodystrophy (MLD) is an autosomal recessive, lysosomal-storage disease caused by mutations in Arylsulfatase A (ARSA) and toxic accumulation of sulfatide substrate. Genome sequencing has revealed hundreds of protein-altering, ARSA missense variants, but the functional effect of most variants remains unknown. ARSA enzyme activity using a high-throughput cellular assay was measured for a large set of variants of known significance and variants of unknown significance. The challenge is to predict the fractional enzyme activity of each mutant protein compared to the wildtype protein.


Arylsulfatase A (ARSA, E.C., ENST00000216124.5 also known as cerebroside sulfatase) is a lysosomal enzyme that hydrolyzes galactose-3-sulfate residues in a number of lipids and ascorbate-2-sulfate and many phenol sulfates. ARSA is synthesized as a 507 amino acid polypeptide that contains an N-terminal signal peptide. Mature arylsulfatase A is a 51.1 kDa protein with 489 amino acids and three glycosylation sites, Asp 158, 184, and 350, forming an octamer at lysosomal pH value (Stütz & Wrodnigg, 2016). The first reported crystal structure of human ARSA (PDB 1AUK) shows a homooctamer composed of a tetramer of dimers (Lukatela et al., 1998). An unusual structural feature is a formylglycine residue in the hydrate form at position 69 which in concert with an octahedrally coordinated Mg2+ is required for enzymatic activity.

Deficiency of ARSA causes Metachromatic Leukodystrophy (MLD, OMIM #250100), an autosomal recessive lysosomal-storage disorder in which sulfatide buildup in cells, particularly in the brain, spinal cord and peripheral nerves, leads to progressive demyelination disease, resulting in a variety of neurological symptoms and ultimately death (Greene et al., 1967). Patients are generally categorized into three subtypes by age of onset: infantile/late-infantile (0 to 2.5 years), juvenile (2.5 to 16 years) and adult (16 years and older); see Gieselmann & Krägeloh-Mann (2010) for a review. 

Preliminary results suggest that early intervention, in pre-symptomatic patients, is imperative, underscoring the importance of newborn screening (NBS) for MLD. NBS screening is complicated by pseudodeficiency variants (Patil & Maegawa, 2013) that decrease ARSA activity to 10-15% of the wildtype range without causing MLD, implying that this level of ARSA activity is sufficient for physiologically hydrolyzing sulfatides.

The incidence of MLD is estimated to range between 1:40,000 and 1:160,000 births according to the National Organization for Rare Disorders, with the most severe form being infantile/late-infantile onset MLD, characterized by rapid progression of psychomotor regression resulting in ataxia and weakness with areflexia. Some children have only signs of a progressive peripheral neuropathy during several months, before central nervous system involvement becomes apparent. Death occurs within a few years after the onset of symptoms. The late-infantile form is genetically characterized by homozygosity or compound heterozygosity for alleles that result in complete ablation of enzymatic activity of ARSA, resulting in rapid accumulation of sulfatides and rapid disease progression. 

In the juvenile variant, disease progression is slower than in the infantile onset form, but once the neurological signs become more evident, the decline is rapid and patients eventually lose all skills. Spasticity becomes prominent, and many patients also develop epilepsy. The end stage of the disease can last several years, and its duration is variable. Patients suffering from the juvenile type mostly carry one allele that allows for expression of low amounts of residual enzyme activity. 

In the adult variant (onset after 16 years of age), intellectual and behavioral changes, such as memory deficits or emotional instability, are usually the first presenting symptoms. Mild polyneuropathy develops in a later stage and disease progression is generally slower than in the infantile and juvenile form. Death occurs within decades after disease onset. In the adult onset type, many patients carry one mild mutation, allowing for the expression of low amounts of functional enzyme, which delays the process of sulfatide accumulation and thereby the onset of the disease.


BioMarin Pharmaceutical is interested in understanding the genotype-phenotype relationship in rare genetic disease. The fundamental challenge with this approach is that there are a large number of variants of uncertain significance (VUS) whose contribution to disease is difficult to assess. Furthermore, as in the case of MLD where there is a spectrum of disease severity, understanding the genotype-phenotype relationship can become a combinatorics problem. It is not sufficient to simply know whether a variant is pathogenic. One must know its impact on activity and which combinations of mutations lead to each disease phenotype. While such information might be obtained by the careful examination of patient phenotypes and genotypes (Gieselmann et al., 1991) for uncommon variants, or those of unknown significance, this might not be possible due to insufficient data. Quantitation of residual enzymatic activity associated with VUS can help provide the missing information. 

BioMarin has functionally assessed the enzymatic activity of 277 ARSA missense variants. Site-specific mutagenesis of the ARSA consensus coding sequence (CCDS Database: 14100.2) in a pUC57 plasmid was performed by GENEWIZ LLC (South Plainfield, NJ, USA). ARSA was under the control of a cytomegalovirus (CMV) promoter, and the control coding sequence, BLA, was reverse-orientated and under the control of an elongation factor 1 (EF1)-alpha promoter. Mutations were also made in the same plasmid construct containing the genomic ARSA transcript (ENST00000216124.5). All constructs were verified via DNA sequencing. Transfections were performed in triplicate (three wells) into HEK293T ARSA-knockout cells, using a 96-well Shuttle Nucleofector (Lonza, Basel, Switzerland) with a Microlab STAR liquid-handling robot (Hamilton Robotics, Reno, NV, USA). BLA activity was determined by measuring hydrolysis of d7-penicillin G (Toronto Research Chemicals, Toronto, ON, Canada) to d7-5R,6R-benzylpenicilloic acid. The BLA assay cocktail consisted of 200 µM d7-penicillin G substrate and 5 µM 5R,6R-benzylpenicilloic acid internal standard in BLA assay buffer (50 mM Tris–HCl, pH 7.5 ± 0.02). Cell lysate supernatant (10 µL) was combined with BLA assay cocktail (30 µL). The plate was sealed, centrifuged at 3,000 g for 1 minute, and agitated at 37°C for 1 hour. The reaction was quenched with 150 µL acetonitrile, and the plate was centrifuged at 3,000 g for 5 minutes. Supernatant (75 µL) was combined with 75 µL water, and the amount of enzyme product (d7-5R,6R-benzylpenicilloic acid) generated during the reaction was quantified using the internal standard, 5R,6R-benzylpenicilloic acid. ARSA activity assays and activity calculations were performed as previously described (Hong et al., 2020). Normalized ARSA activity in cell lysate was calculated by dividing the ion ratio of ARSA product to ARSA internal standard (blank subtracted) by the ion ratio of BLA product to BLA internal standard (blank subtracted). The results from these determinations were averaged, and the standard deviation calculated. Data from activity assays were used to assign activity-based severity to variants.

Prediction challenge

Participants are asked to submit predictions on the effect of the variants on ARSA enzymatic activity. The submitted prediction should be a numeric value ranging from 0 (no activity) to 1 (wildtype level of activity), or >1 if the predicted activity is greater than wildtype activity (e.g., 0.7 means 70% of wildtype and 1.3 means 130% of wildtype activity). Each predicted activity must include a standard deviation that indicates the confidence in the provided prediction score; i.e., low spread suggests confident predictions and high spread suggests lack of confidence. Optionally, a comment on the basis of the prediction may be given. The predictions will be assessed against the numeric values actually measured for each mutation in the enzyme assay. In the previous challenges, it has been observed that predictions often cluster more with other predictions other than with the experimental value (Clark et al, 2019). Assessment will include metrics that recognize prediction sets that differ substantially from results provided by standard methods such as PolyPhen-2 and SIFT.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a template file, which must be used for submission. In addition, a validation script is provided, and predictors must check the correctness of the format before submitting their predictions.

Each data row in the submitted file must include the following columns:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a prediction and standard deviation for every variant; if you are not confident in a prediction for a variant, enter a large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description of your method, use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Download data 

The variants are provided using the HGVS variant nomenclature.

Download dataset: arsavariantlist.txt

Download submission template: arsasubmissiontemplate.tsv 

Download submission validation script: 

Training data 

No training data is provided. The participants may wish to use known variants from ClinVar, gnomAD, HGMD, and UniProtKB to train and calibrate their models.


This challenge follows the tradition of CAGI challenges that assess the predictions of biochemical effects for missense variants. The evaluation protocols and metrics will follow those described by the The Critical Assessment of Genome Interpretation Consortium (2022). Predictions will be assessed by an independent assessor. 

Related challenges


Clark WT, et al. Assessment of predicted enzymatic activity of α-N-acetylglucosaminidase variants of unknown significance for CAGI 2016. Hum Mutat (2019) 40(9):1519-1529. PubMed 

Gieselmann V, et al. Molecular genetics of metachromatic leukodystrophy. Dev Neurosci (1991) 13(4-5):222-227. PubMed 

Gieselmann V, Krägeloh-Mann I. Metachromatic leukodystrophy--an update. Neuropediatrics. (2010) 41(1):1-6. PubMed 

Greene H, et al. Arylsulfatase A in the urine and metachromatic leukodystrophy. J Pediatr (1967) 71(5):709-711. PubMed 

Hong X, et al. Leukocyte and dried blood spot arylsulfatase A assay by tandem mass spectrometry. Anal Chem 92(9):6341-6348, 2020. PubMed 

Lukatela G, et al. Crystal structure of human arylsulfatase A: the aldehyde function and the metal ion at the active site suggest a novel mechanism for sulfate ester hydrolysis. Biochemistry (1998) 37(11):3654-3664. PubMed 

Patil SA, Maegawa GH. Developing therapeutic approaches for metachromatic leukodystrophy. Drug Des Devel Ther (2013) 7:729-745. PubMed 

Stütz AE, Wrodnigg TM. Carbohydrate-processing enzymes of the lysosome: diseases caused by misfolded mutants and sugar mimetics as correcting pharmacological chaperones. Adv Carbohydr Chem Biochem (2016) 73:225-302. PubMed 

The Critical Assessment of Genome Interpretation Consortium. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. arXiv:2205.05897 

Dataset provided by 

Wyatt Clark, BioMarin Pharmaceutical, Inc.,105 Digital Drive, Novato, CA 94949

Revision history 

25 July 2022: initial release of the challenge, challenge open

3 September 2022: submission deadline extended from September 9 to October 1

28 September 2022: submission deadline extended from October 1 to November 15.

16 November 2022: the challenge is closed.