CAGI Scientist Opening: apply here ☀️

N-acetyl-glucosaminidase (NAGLU): predict the effect of naturally occurring missense mutations on cellular enzymatic activity

Challenge: NAGLU

Dataset description: public

Variant data: public

Last updated: 7 April 2016

This challenge closed at 9:00 PM PST (Pacific Standard Time) on 8 December 2015.

Download answer key, predictions, and assessment: registered users only, limited by CAGI Data Use Agreement. The answer key, predictions, and assessment files are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in access the file.

Presentations from the CAGI 4 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


NAGLU is a lysosomal glycohydrolyase. Deficiency of NAGLU causes the rare disorder Mucopolysaccharidosis IIIB or Sanfilippo B disease. Naturally occurring NAGLU mutants have been assayed for enzymatic activity in transfected cell lysates. The challenge is to predict the fractional activity of each mutant protein compared to the wild-type enzyme.


N-acetyl-glucosaminidase (NAGLU, NP_000254.2) is a lysosomal enzyme that hydrolyzes N-acetyl D-glucosamine from the nonreducing end of heparan sulfate (HS). The human protein is a homo-trimer composed of 720 amino acids in each subunit (not including the 23 amino acid signal peptide) (1-3). The predicted molecular weight of each subunit is 80,345 daltons. Each subunit has seven potential glycosylation sites. NAGLU is proteolytically processed in the lysosome to a mature form, but the precursor protein is enzymatically active. NAGLU is a member of the family 89 glycohydrolyases (Carbohydrate Active Enzymes database: (4) for which some structural information exists including a recent patent that describes a 2.9 Å resolution structure of NAGLU (5,6). Coordinates of the human protein structures are not in the PDB but are available on the USPTO web site.

Deficiency of NAGLU causes Mucopolysaccharidosis IIIB (MPS IIIB) or Sanfilippo B disease (7,8) (OMIM #252920), an autosomal recessive lysosomal storage disorder in which lysosomal HS accumulation causes a neurodegenerative disease whose clinical presentation includes intellectual disability that progresses to dementia, behavioral disturbances, and death in the second or third decade, reviewed in (9). BioMarin is currently developing an enzyme replacement therapy to treat MPS IIIB patients.

MPS IIIB is an Orphan indication with a birth incidence that varies substantially in European populations with higher incidences found in southern European countries: 0.08/100,000 in France, 0.21/100,000 in the UK, 0.36/100,000 in Germany, 0.42/100,000 in the Netherlands, 0.78/100,000 in Greece, 2.6/100,000 in Germans of Turkish descent, and 0.78/100,000 in Portugal (10-13). One of the issues in working with such rare disorders is the challenge of developing accurate assessments of disease incidence. Given the large and ever expanding number of publically available whole exome sequencing data, it should be possible to extract incidence information from allele frequencies in such datasets, provided one has a knowledge of which mutations are associated with the disease phenotype. There are 153 NAGLU mutations reported by HGMD (14) to be associated with MPS IIIB, and 90 of these are missense mutations. The ExAC dataset (, release 0.3) which comprises ~60,000 individual sequenced exomes, contains 189 missense mutations in NAGLU, of which 24 are known to be disease associated. Thus, most of the known disease-associated alleles are absent from the ExAC dataset, and the ExAC dataset contains a large number of missense mutations whose contribution to disease incidence is unknown.

This illustrates the fundamental problem with efforts to extract incidence information from allele frequency data: given a rare disease caused by a large number of ultra-rare mutations, there will be a large number of variants of unknown significance whose contribution to disease is difficult to assess.

BioMarin is functionally assessing the enzymatic activity of each of the 165 novel missense mutations in the ExAC dataset. Plasmids containing cDNAs encoding each of the mutant proteins are being transfected into HEK293 cells. After 72 hours, cells are lysed, and NAGLU activity in the lysate is assessed using the fluorogenic substrate 4-Methylumbelliferyl N-acetyl-α-D-glucosaminide. The activity units are pMol/min/ug protein. The background activity arising from endogenous NAGLU is subtracted using activity levels obtained from a mock transfection with empty vector. Background subtracted enzyme activity for each mutant is normalized to the background subtracted activity in a cell lysate from cells transfected with the wild-type cDNA and reported as per cent wild-type NAGLU activity. Each mutant is being assayed in at least three independent transfection experiments, and the results from these three determinations will be averaged, and the standard deviation will be calculated.

Prediction challenge

Participants are asked to submit predictions on the effect of the variants on NAGLU enzymatic activity. The submitted prediction should be a numeric value ranging from 0 (no activity) to 1 (wild-type level of activity) or >1 if the predicted activity is greater than wild-type activity (e.g. 0.7 means 70% of wild-type and 1.3 means 130% of wild-type activity). Each predicted activity must include a standard deviation. Optionally, a comment on the basis of the prediction may be given. The predictions will be assessed against the numeric values actually measured for each mutation in the enzyme assay.

Download dataset: 4-NAGLU_dataset.txt (4.2 KB)

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download submission validation script: This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a template file, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions. In the submitted file, each row includes the following columns:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must enter a prediction and standard deviation for every mutant; if you are not confident in a prediction for a mutant, enter a large standard deviation for the prediction. Optionally, enter a brief comment on the basis of the prediction;, otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission should include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ page.


Data provided by 

Jonathan H. LeBowitz, Wyatt T. Clark, G. Karen Yu 

BioMarin Pharmaceutical, Inc. 105 Digital Drive, Novato CA 94949

Dataset citation 

Clark WT, et al. Utilizing ExAC to assess the hidden contribution of variants of unknown significance to Sanfilippo Type B incidence. PLoS One (2018)13(7):e0200008. PubMed 


3 Aug 2015 (v01): initial release 

4 Aug 2015 (v02): revised description of enzyme assay 

5 Aug 2015 (v03): repaired link to dataset file 

4 Sep 2015 (v04): challenge close date added 

28 Oct 2015 (v05): submission instructions and template updated; validation script provided 

7 Nov 2015 (v06): submission deadline extended

12 Nov 2015 (v07): improved validation script provided 

23 Dec 2015 (v08): answer key provided 

25 Jan 2016 (v09): updated answer key provided with corrected value for R737C 

18 Mar 2016 (v10): predictions provided 

7 Apr 2016 (v11): conference presentations provided

12 June 2023: dataset citation added