CAGI Scientist Opening: apply here ☀️

Clotting disease (DVT or PE) exomes

Challenge: Clotting disease (DVT or PE) exomes

Dataset description: public

Dataset availability: encrypted

Exome sequence data: registered users only, limited by CAGI Data Use Agreement

Last updated: 14 April 2018

This challenge will tentatively close at 9:00 PM PST (Pacific Standard Time) on 23 Apr 2018.

The challenge is closed - encrypted data is not available.

Download answer key and predictions: registered users only, limited by CAGI Data Use Agreement. The answer key and predictions are accessible to registered users only, and their use is limited by the CAGI Data Use agreement. Please log in to access the file.

Presentations from the CAGI 5 conference: registered users only, limited by CAGI Data Use Agreement. Presentations are accessible to registered users only, and their use is limited by the CAGI Data Use Agreement. Please log in to access the file.


African Americans have a higher incidence of developing venous thromboembolisms (VTE), which includes deep vein thrombosis (DVT) and pulmonary embolism (PE), than people of European ancestry. Participants are provided with exome data and clinical covariates for a cohort of African Americans who have been prescribed Warfarin either because they had experienced a VTE event or had been diagnosed with atrial fibrillation (which predisposes to clotting). The challenge is to distinguish between these conditions. At present, in contrast to European ancestry, there are no genetic methods for anticipating which African Americans are most at risk of a venous thromboembolism, and the results of this challenge may contribute to the development of such tools.


Each year in the US, 300,000 to 900,000 people suffer a venous thromboembolism (VTE) (Beckman, et al, 2010). There are two types of VTE: deep vein thrombosis (DVT) and pulmonary embolism (PE). African Americans have a 30–60% higher incidence of either or both than people of European ancestry (Roberts et al., 2009; Zakai & McClure, 2011). The risk factors for VTE are complex and include environmental risk factors (e.g., vessel injury; and blood stasis) and genetic risk factors including common and/or rare variants that predispose to hypercoagulation (Feero, 2004). Clinically tested genetic variants in European descent individuals include variant F5 p.R506Q (5% prevalence) (MIM#: 612309), which confers a three to fivefold increased risk of VTE in carriers and variant F2 p.G20210A (0.7 to 4% prevalence) (MIM#: 176930), which confers a two to threefold increased risk of VTE in carriers(Middeldorp & van Hylckama Vlieg, 2008; Rosendaal & Reitsma, 2009). However, the typical genetic risk factors in populations of European descent are nearly absent in African Americans, and population-specific genetic factors influencing the higher VTE rate are not well characterized (Dowling, et al., 2003). A study published in Molecular Genetics and Genomic Medicine identified a population-specific genetic risk factor in African Americans, but much of the genetic risk is still undiscovered (Daneshjou et al., 2016).

Indication for warfarin prescription. Warfarin is an anticoagulant therapy that works by inhibiting vitamin K epoxide reductase, which is essential for the production of clotting factor II, VII, IX and X, and (to a lesser extend) protein C and Protein S within the liver. Its indications primarily include non valvular atrial fibrillation, inherited clotting disorders that predispose to VTE, and for the treatment of DVT and PE.

Prediction challenge

Exome data and clinical covariates are provided for a cohort of African American individuals who have been prescribed long term warfarin. Some individuals were prescribed warfarin because they had experienced a venous thromboembolism (VTE) (either a deep vein thrombosis (DVT) or a pulmonary embolism (PE)). Others had not experienced a VTE (those individuals were prescribed warfarin because of an atrial fibrillation diagnosis).

The prediction challenge is to identify which individuals had a VTE due to a clotting disorder versus which had atrial fibrillation, based on the provided exome data. VTE (DVT or PE) may result from an environmental trigger, such as prolonged inactivity/immobility, recent surgery, or malignancies. However most of the VTE individuals in this study have not been subject to these factors, and have ‘unprovoked’ VTE.

Dataset description

The data set contains the following components:

A description of how the genomic data were collected is available in the methods section of (Daneshjou et al., 2014). Reference (Daneshjou et al., 2014) contains a large amount of other relevant information, including an analysis and prediction model developed by the dataset providers.

Download dataset: This dataset file is available only to registered users. Please log in to access the file.

Download clinical co-variates file: This submission template file is available only to registered users. Please log in to access the file.

Download submission template: This submission template file is available only to registered users. Please log in to access the file.

Download validation script: This submission validation script is available only to registered users. Please log in to access the file.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script will be provided, and predictors must check the correctness of the format before submitting their predictions.

In the submitted file, each row must include the following tab-separated fields:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. If you choose not to enter a prediction leave the "*" in those cells. If you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter brief comments indicating the basis of the predictions; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you must create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Ethical consideration

The data in this challenge are derived from a cohort of individuals who are described as being of African ancestry, and more specifically, they are presented as being “African American,” a social category, in the sense that being a member of this group is determined by social relationships rather than by biological characteristics. Yet the prevalence of many gene variants differs in populations of different geographic ancestry and these differences may sometimes result in differences in disease risk. As a result, the social category of “African-American” cannot be fully or easily separated from the genetic implications of having recent ancestors from Africa. Population genetic differences have implications for genomic research; inclusion of all human populations is needed to ensure a complete understanding of the genetic contributors to health. But the goal of inclusion has not yet been achieved: under-representation of individuals of non-European ancestry persists. The lack of diversity in population genomic datasets presents a challenge to the meaningful interpretation of genomic findings in diverse populations, as well as a challenge in the just translation of genomic results into clinical care.

Attending to this challenge necessitates building and maintaining trusting relationships between researchers and participants, as well as addressing systemic racial inequities and historical power imbalances. This CAGI challenge addresses the apparent increased incidence of VTE in African-Americans compared to individuals of European ancestry, and seeks to inform a better understanding of genome variants that may contribute to risk of VTE through study of an African-American population.

However, it is critical for researchers to attend to the potential negative consequences of associating a particular disease variant with a specific population. Such associations must be drawn with care. Although prevalence of genomic variants may differ, few genomic variants associated with disease are found exclusively in one ancestral population. Also, it is critical to elucidate genetic and environmental correlates of a common disease. Because the category of “African American” is primary social, factors such as differential exposure to poverty and environmental risk factors, as well as discrimination that affects access to high quality health care, may account for differences in VTE risk. Although it may be useful to study variation within a single under-represented population, in order to ensure broader understanding of genetic contributors to a particular disease, researchers should use care not to assume that differences in disease risk among populations are due to a genetic cause.

In presenting results, researchers have an obligation to be attentive to the language they use when describing population categories. Associating genes or genetic variants with a particular population may lead to stigma. Further, some terms, such as “Caucasian,” are archaic categories associated with an era of scientific racism. For research to be valid, it must draw careful distinctions between geographic ancestry and social aspects of difference across human populations. Researchers can do this by paying careful attention to how they describe and report the populations they study. Scientists using this CAGI data set should consider these issues as they report their research results.

Dataset provided by

Roxana Daneshjou and Russ Altman, Stanford University School of Medicine


Dowling, NF, Austin H, Dilley A, Whitsett C, Evatt BL, Hooper WC. 2003. The epidemiology of venous thromboembolism in Caucasians and African-Americans: the GATE Study. Journal of Thrombosis and Haemostasis 1:80–7. Retrieved from doi:10.1046/j.1538-7836.2003.00031.x

Daneshjou R, Cavallari LH, Weeke PE, Karczewski KJ, Drozda K, Perera MA, Johnson JA, Klein TE, Bustamante CD, Roden DM, Shaffer C, Denny JC, Zehnder JL, Altman RB. 2016. Population-specific single-nucleotide polymorphism confers increased risk of venous thromboembolism in african americans. Mol Genet Genomic Med 4:513-520. PMCID:PMC5023936. doi:10.1002/mgg3.226

Feero WG. 2004. Genetic thrombophilia. Prim Care 31:685-709, xi. doi:10.1016/j.pop.2004.04.014

Middeldorp S, van Hylckama Vlieg A. 2008. Does thrombophilia testing help in the clinical management of patients? Br J Haematol 143:321-335. doi:10.1111/j.1365-2141.2008.07339.x

Roberts LN, Patel RK, Arya R. 2009. Venous thromboembolism and ethnicity. Br J Haematol 146:369-383. doi:10.1111/j.1365-2141.2009.07786.x

Rosendaal FR, Reitsma PH. 2009. Genetics of venous thrombosis. J Thromb Haemost 7 Suppl 1:301-304. doi:10.1111/j.1538-7836.2009.03394.x

Zakai NA, McClure LA. 2011. Racial differences in venous thromboembolism. J Thromb Haemost 9:1877-1882. doi:10.1111/j.1538-7836.2011.04443.x

Revision history

23 Nov 2017 (v01): initial release

30 Nov 2017 (v02): closing date

15 Apr 2017 (v03): typos fixed

24 Sep 2018 (v04): Dataset availability added