CAGI Scientist Opening: apply here ☀️
Annotate all missense
Challenge: Annotate all missense
Variant data: public
Last updated: 12 October 2021
This challenge is closed.
Make sure you understand our Data Use Agreement and Anonymity Policy
dbNSFP currently describes 81,782,923 possible protein-altering variants in the human genome. The challenge is to predict the functional effect of every such variant. For the vast majority of these missense and nonsense variants, the functional impact is not currently known, but experimental and clinical evidence is accruing rapidly. Rather than drawing upon a single discrete dataset as typical with CAGI, predictions will be assessed by comparing with experimental or clinical annotations made available after the prediction submission date, on an ongoing basis. If predictors assent, predictions will also be incorporated into dbNSFP.
Currently, dozens of in silico methods for predicting the deleteriousness of a variant are available (Hu et al. 2019). In many cases, different methods may give opposite predictions for the same variant. dbNSFP is a database of human nonsynonymous single nucleotide variants (nsSNVs) and their functional predictions and annotations (Liu et al., 2011, 2013, 2016, 2020). It compiles 37 functional prediction scores and 9 conservation scores, as well as other related information including allele frequencies observed in different large datasets, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information.
A list of all possible nsSNVs based on the human reference sequence was created from dbNSFP v4 (Liu et al., 2020). Predictors are asked to predict the functional effect predict each coding SNV. Since the vast majority of these nsSNVs do not have experimental information, this challenge will assess in silico predictions with new experimental or clinical annotations as they appear in literature or databases. We anticipate making regular evaluations at the time of each CAGI experiment.
Test file format
Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.
In the submitted file, each row must include the following tab-separated fields:
In the file, cells in columns 5-8 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must submit predictions and standard deviation for all the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.
In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.
If predictors assent, predictions will also be incorporated into dbNSFP. This must be explicitly specified in the document describing the method.
CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.
Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)
To include a description of your method, use the following filename: <teamname>_desc.*
Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.
Hu Z, et al. VIPdb, a genetic Variant Impact Predictor database. Hum Mutat (2019) 40:1202-1214. PubMed
Liu X, et al. dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat (2011) 32:894-899. PubMed
Liu X, et al. dbNSFPv2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum Mutat (2013) 34:E2393-2402. PubMed
Liu X, et al. dbNSFPv3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum Mutat (2016) 37:235-241. PubMed
Liu X, et al. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med (2020) 12:103. PubMed
Divided by chromosomes: dbNSFP4_nsSNV.zip (790MB).
Download submission template extract: annotateallmissensetemplate.zip
Download submission validation script: annotateallmissensevalidation.py
Dataset provided by
Xiaoming Liu, University of South Florida
03 May 2021: initial release
01 June 2021: challenge opens
08 June 2021: file naming instructions added
13 August 2021: submission deadline extended to September 30
30 September 2021: submission deadline extended to October 11
11 October 2021: challenge closed