Clotting disease (DVT or PE) exomes

Clotting disease (DVT or PE) exomes
Dataset description: public
Exome sequence data: registered users only, limited by CAGI Data use agreement
Last updated: 12 Feb 2018
This challenge will tentatively be closed at 9:00 PM PST (Pacific Standard Time) on 16 Apr 2018 .

[Summary] [Background] [Prediction Challenge] [Prediction Submission Format] [Ethical consideration] [References] [Revision history]

African Americans have a higher incidence of developing venous thromboembolisms (VTE), which includes deep vein thrombosis (DVT) and pulmonary embolism (PE), than people of European ancestry. Participants are provided with exome data and clinical covariates for a cohort of African Americans who have been prescribed Warfarin either because they had experienced a VTE event or had been diagnosed with atrial fibrillation (which predisposes to clotting). The challenge is to distinguish between these conditions. At present, in contrast to European ancestry, there are no genetic methods for anticipating which African Americans are most at risk of a venous thromboembolism, and the results of this challenge may contribute to the development of such tools.

Each year in the US, 300,000 to 900,000 people suffer a venous thromboembolism (VTE) (Beckman, et al, 2010). There are two types of VTE: deep vein thrombosis (DVT) and pulmonary embolism (PE). African Americans have a 30–60% higher incidence of either or both than people of European ancestry (Roberts et al., 2009; Zakai & McClure, 2011). The risk factors for VTE are complex and include environmental risk factors (e.g., vessel injury; and blood stasis) and genetic risk factors including common and/or rare variants that predispose to hypercoagulation (Feero, 2004). Clinically tested genetic variants in European descent individuals include variant F5 p.R506Q (5% prevalence) (MIM#: 612309), which confers a three to fivefold increased risk of VTE in carriers and variant F2 p.G20210A (0.7 to 4% prevalence) (MIM#: 176930), which confers a two to threefold increased risk of VTE in carriers(Middeldorp & van Hylckama Vlieg, 2008; Rosendaal & Reitsma, 2009). However, the typical genetic risk factors in populations of European descent are nearly absent in African Americans, and population-specific genetic factors influencing the higher VTE rate are not well characterized (Dowling, et al., 2003). A study published in Molecular Genetics and Genomic Medicine identified a population-specific genetic risk factor in African Americans, but much of the genetic risk is still undiscovered (Daneshjou et al., 2016).

Indication for warfarin prescription. Warfarin is an anticoagulant therapy that works by inhibiting vitamin K epoxide reductase, which is essential for the production of clotting factor II, VII, IX and X, and (to a lesser extend) protein C and Protein S within the liver. Its indications primarily include non valvular atrial fibrillation, inherited clotting disorders that predispose to VTE, and for the treatment of DVT and PE.

Prediction challenge
Exome data and clinical covariates are provided for a cohort of African American individuals who have been prescribed long term warfarin. Some individuals were prescribed warfarin because they had experienced a venous thromboembolism (VTE) (either a deep vein thrombosis (DVT) or a pulmonary embolism (PE)). Others had not experienced a VTE (those individuals were prescribed warfarin because of an atrial fibrillation diagnosis).

The prediction challenge is to identify which individuals had a VTE due to a clotting disorder versus which had atrial fibrillation, based on the provided exome data. VTE (DVT or PE) may result from an environmental trigger, such as prolonged inactivity/immobility, recent surgery, or malignancies. However most of the VTE individuals in this study have not been subject to these factors, and have ‘unprovoked’ VTE.

Dataset description
The data set contains the following components:

  1. A jointly called vcf file of genotypes for exomes of 103 African Americans on warfarin.
  2. A file of clinical co-variates for the same individuals.

A description of how the genomic data were collected is available in the methods section of (Daneshjou et al., 2014). Reference (Daneshjou et al., 2014). contains a large amount of other relevant information, including an analysis and prediction model developed by the dataset providers.

Download dataset
This dataset file is available only to registered users. Please log in to access the file.

Download clinical co-variates file
This submission template file is available only to registered users. Please log in to access the file.

Download submission template
This submission template file is available only to registered users. Please log in to access the file.

Download validation script
This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format
The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script will be provided, and predictors must check the correctness of the format before submitting their predictions.

In the submitted file, each row must include the following tab-separated fields:

  1. Individual - The ID number of the individual - pre-populated field
  2. Disease_status - The probability that this individual had a VTE and is on warfarin due to a clotting disorder. The probability should be a value 0 or 1, with 0 meaning the individual had atrial fibrillation and 1 meaning the individual had a VTE.
  3. Standard_deviation - This defines the confidence of the prediction in column 2. Large SD means low confidence, while small SD means that the predictor is confident about the submitted prediction
  4. Comment - Optional brief comments on the basis of the predictions in columns 2

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. If you choose not to enter a prediction leave the "*" in those cells. If you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter brief comments indicating the basis of the predictions; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you must create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Ethical consideration
The data in this challenge are derived from a cohort of individuals who are described as being of African ancestry, and more specifically, they are presented as being “African American,” a social category, in the sense that being a member of this group is determined by social relationships rather than by biological characteristics. Yet the prevalence of many gene variants differs in populations of different geographic ancestry and these differences may sometimes result in differences in disease risk. As a result, the social category of “African-American” cannot be fully or easily separated from the genetic implications of having recent ancestors from Africa. Population genetic differences have implications for genomic research; inclusion of all human populations is needed to ensure a complete understanding of the genetic contributors to health. But the goal of inclusion has not yet been achieved: under-representation of individuals of non-European ancestry persists. The lack of diversity in population genomic datasets presents a challenge to the meaningful interpretation of genomic findings in diverse populations, as well as a challenge in the just translation of genomic results into clinical care.

Attending to this challenge necessitates building and maintaining trusting relationships between researchers and participants, as well as addressing systemic racial inequities and historical power imbalances. This CAGI challenge addresses the apparent increased incidence of VTE in African-Americans compared to individuals of European ancestry, and seeks to inform a better understanding of genome variants that may contribute to risk of VTE through study of an African-American population.

However, it is critical for researchers to attend to the potential negative consequences of associating a particular disease variant with a specific population. Such associations must be drawn with care. Although prevalence of genomic variants may differ, few genomic variants associated with disease are found exclusively in one ancestral population. Also, it is critical to elucidate genetic and environmental correlates of a common disease. Because the category of “African American” is primary social, factors such as differential exposure to poverty and environmental risk factors, as well as discrimination that affects access to high quality health care, may account for differences in VTE risk. Although it may be useful to study variation within a single under-represented population, in order to ensure broader understanding of genetic contributors to a particular disease, researchers should use care not to assume that differences in disease risk among populations are due to a genetic cause.

In presenting results, researchers have an obligation to be attentive to the language they use when describing population categories. Associating genes or genetic variants with a particular population may lead to stigma. Further, some terms, such as “Caucasian,” are archaic categories associated with an era of scientific racism. For research to be valid, it must draw careful distinctions between geographic ancestry and social aspects of difference across human populations. Researchers can do this by paying careful attention to how they describe and report the populations they study. Scientists using this CAGI data set should consider these issues as they report their research results.


  1. Dowling, NF, Austin H, Dilley A, Whitsett C, Evatt BL, Hooper WC. 2003. The epidemiology of venous thromboembolism in Caucasians and African-Americans: the GATE Study. Journal of Thrombosis and Haemostasis 1:80–7. Retrieved from doi:10.1046/j.1538-7836.2003.00031.x
  2. Daneshjou R, Cavallari LH, Weeke PE, Karczewski KJ, Drozda K, Perera MA, Johnson JA, Klein TE, Bustamante CD, Roden DM, Shaffer C, Denny JC, Zehnder JL, Altman RB. 2016. Population-specific single-nucleotide polymorphism confers increased risk of venous thromboembolism in african americans. Mol Genet Genomic Med 4:513-520. PMCID:PMC5023936. doi:10.1002/mgg3.226
  3. Feero WG. 2004. Genetic thrombophilia. Prim Care 31:685-709, xi. doi:10.1016/j.pop.2004.04.014
  4. Middeldorp S, van Hylckama Vlieg A. 2008. Does thrombophilia testing help in the clinical management of patients? Br J Haematol 143:321-335. doi:10.1111/j.1365-2141.2008.07339.x
  5. Roberts LN, Patel RK, Arya R. 2009. Venous thromboembolism and ethnicity. Br J Haematol 146:369-383. doi:10.1111/j.1365-2141.2009.07786.x
  6. Rosendaal FR, Reitsma PH. 2009. Genetics of venous thrombosis. J Thromb Haemost 7 Suppl 1:301-304. doi:10.1111/j.1538-7836.2009.03394.x
  7. Zakai NA, McClure LA. 2011. Racial differences in venous thromboembolism. J Thromb Haemost 9:1877-1882. doi:10.1111/j.1538-7836.2011.04443.x

Dataset provided by

Roxana.jpg RussAltman_1.jpg

Roxana Daneshjou and Russ Altman, Stanford University School of Medicine

Revision history
23 Nov 2017 (v01): initial release

30 Nov 2017 (v02): closing date