Annotate everything

Challenge: Annotate everything Challenge
Dataset description: public
Variant data: public
Last updated: 30 Nov 2017

[Summary] [Background] [Experiment] [Prediction Challenge] [Prediction Submission Format] [References] [Revision history]

Summary
dbNSFP describes 81,084,849 possible protein-altering variants in the human genome. The challenge is to predict the functional effect of every such variant. For the vast majority of these missense and nonsense variants, the functional impact is not currently known, but experimental and clinical evidence are accruing rapidly. Rather than drawing upon a single discrete dataset as typical with CAGI, predictions will be assessed by comparing with experimental or clinical annotations made available after the prediction submission date, on an ongoing basis. If predictors assent, predictions will also incorporated into dbNSFP.

Background
Currently, dozens of in silico methods for predicting the deleteriousness of a variant are available. In many cases, different methods may give opposite predictions for the same variant. dbNSFP is a database of human nonsynonymous single nucleotide variants (nsSNVs) and their functional predictions and annotations (Liu et al., 2011, 2013; Liu et al., 2016).. It compiles 20 functional prediction scores and 6 conservation scores, as well as other related information including allele frequencies observed in different large datasets, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information.

Prediction challenge
A list of all possible nsSNVs based on the human reference sequence was created from dbNSFP (Liu et al., 2011, 2013; Liu et al., 2016). Predictors are asked to predict the functional effect predict each coding SNV.

Since the vast majority of these nsSNVs do not have experimental information, this challenge will assess in silico predictions with new experimental or clinical annotations as they appear in literature or databases. We anticipate making evaluations at the time of each CAGI experiment, approximately once a year.

Test File Format

  1. chr: Chromosome number
  2. pos(1-based): Physical position on the chromosome as to hg38 (1-based coordinate). For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920).
  3. Ref: Reference nucleotide allele (as on the + strand)
  4. Alt: Alternative nucleotide allele (as on the + strand)
  5. aaref: Reference amino acid. "X" if the variant is a stop-loss
  6. aaalt: Alternative amino acid. "X" if the variant is a stop-gain
  7. hg19_chr: Chromosome as to hg19, "." means missing
  8. hg19_pos(1-based): Physical position on the chromosome as to hg19 (1-based coordinate). For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
  9. hg18_chr: Chromosome as to hg18, "." means missing 1
  10. hg18_pos(1-based): Physical position on the chromosome as to hg18 (1-based coordinate). For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
  11. Genename: Gene name; if the nsSNV can be assigned to multiple genes, gene names are separated by ";"
  12. Cds_strand: Coding sequence (CDS) strand (+ or -)
  13. Refcodon: Reference codon
  14. Codonpos: Position on the codon (1, 2 or 3)
  15. Ensembl_geneid: Ensembl gene id
  16. Ensembl_transcriptid: Ensembl transcript ids (Multiple entries separated by ";")
  17. Ensembl_proteinid: Ensembl protein ids. Multiple entries separated by ";", corresponding to Ensembl_transcriptids
  18. AA: Amino acid position with respect to protein. Multiple entires separated by “;”, corresponding to Ensembl proteinid.

Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row must include the following tab-separated fields:

  1. chr: chromosome number
  2. pos(1-based): physical position on the chromosome as to hg38 (1-based coordinate). For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920).
  3. ref: reference nucleotide allele (as on the + strand)
  4. alt: alternative nucleotide allele (as on the + strand)
  5. Prediction Score: Annotation score for each SNV from 0 (benign) to 1 (deleterious)
  6. SD: standard deviation of the prediction in column 5 indicating confidence
  7. Pred: Based on the score in column 19, indicate whether the SNV is "D(amaging)" , "T(olerated)" or "U(known)"
  8. Comments: optional brief comment on the basis of the prediction in column

In the file, cells in columns 5-8 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must submit predictions and standard deviation for all the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar to the style of the Methods section in a scientific article. This information must be submitted as a separate file.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

If predictors assent, predictions will also be incorporated into dbNSFP. This must be explicitly specified in the document describing the method.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link "All submission forms" from the front page of your group. For more details, please read the FAQ pagege.

References

  1. Liu X, Jian X, Boerwinkle E. 2011. Dbnsfp: A lightweight database of human nonsynonymous snps and their functional predictions. Hum Mutat 32:894-899. PMCID:PMC3145015. doi:10.1002/humu.21517
  2. Liu X, Jian X, Boerwinkle E. 2013. Dbnsfp v2.0: A database of human non-synonymous snvs and their functional predictions and annotations. Hum Mutat 34:E2393-2402. PMCID:PMC4109890. doi:10.1002/humu.22376
  3. Liu X, Wu C, Li C, Boerwinkle E. 2016. Dbnsfp v3.0: A one-stop database of functional predictions and annotations for human nonsynonymous and splice-site snvs. Hum Mutat 37:235-241. PMCID:PMC4752381. doi:10.1002/humu.22932

Download dataset
Devided by chromosomes: 5-dbNSFP3.4a_per_chr.zip (633.79 MB)

All chromosomes in one file: dbNSFP3.4a_all_variant_subset.zip (633.79 MB)

Download submission template extract
This submission validation script is available only to registered users. Please log in to access the file.

Download submission validation script
This submission validation script is available only to registered users. Please log in to access the file.

Dataset Providers:
Xaoming Liu_mediu.jpg

Xiaoming Liu from the University of Texas School of Public Health

Revision history
30 Nov 2017 (v01): initial release