CAGI Scientist Opening: apply here ☀️

Clotting disease (DVT or PE) exomes

Challenge: Clotting disease (DVT or PE) exomes 

Dataset availability: encrypted, registered users only

Last updated: 14 April 2018

This challenge is closed.

Make sure you understand our Data Use Agreement and Anonymity Policy


African Americans have a higher incidence of developing venous thromboembolisms (VTE), which includes deep vein thrombosis (DVT) and pulmonary embolism (PE), than people of European ancestry. Participants are provided with exome data and clinical covariates for a cohort of African Americans who have been prescribed Warfarin either because they had experienced a VTE event or had been diagnosed with atrial fibrillation (which predisposes to clotting). The challenge is to distinguish between these conditions. At present, in contrast to European ancestry, there are no genetic methods for anticipating which African Americans are most at risk of a venous thromboembolism, and the results of this challenge may contribute to the development of such tools.


Each year in the US, 300,000 to 900,000 people suffer a venous thromboembolism (VTE) (Beckman et al, 2010). There are two types of VTE: deep vein thrombosis (DVT) and pulmonary embolism (PE). African Americans have a 30–60% higher incidence of either or both than people of European ancestry (Roberts et al., 2009; Zakai & McClure, 2011). The risk factors for VTE are complex and include environmental risk factors (e.g., vessel injury; and blood stasis) and genetic risk factors including common and/or rare variants that predispose to hypercoagulation (Feero, 2004). Clinically tested genetic variants in European descent individuals include variant F5 p.R506Q (5% prevalence) (MIM#: 612309), which confers a three to fivefold increased risk of VTE in carriers and variant F2 p.G20210A (0.7 to 4% prevalence) (MIM#: 176930), which confers a two to threefold increased risk of VTE in carriers(Middeldorp & van Hylckama Vlieg, 2008; Rosendaal & Reitsma, 2009). However, the typical genetic risk factors in populations of European descent are nearly absent in African Americans, and population-specific genetic factors influencing the higher VTE rate are not well characterized (Dowling et al., 2003). A study published in Molecular Genetics and Genomic Medicine identified a population-specific genetic risk factor in African Americans, but much of the genetic risk is still undiscovered (Daneshjou et al., 2016).

Indication for warfarin prescription. Warfarin is an anticoagulant therapy that works by inhibiting vitamin K epoxide reductase, which is essential for the production of clotting factor II, VII, IX and X, and (to a lesser extend) protein C and Protein S within the liver. Its indications primarily include non valvular atrial fibrillation, inherited clotting disorders that predispose to VTE, and for the treatment of DVT and PE.

Prediction challenge

Exome data and clinical covariates are provided for a cohort of African American individuals who have been prescribed long term warfarin. Some individuals were prescribed warfarin because they had experienced a venous thromboembolism (VTE) (either a deep vein thrombosis (DVT) or a pulmonary embolism (PE)). Others had not experienced a VTE (those individuals were prescribed warfarin because of an atrial fibrillation diagnosis).

The prediction challenge is to identify which individuals had a VTE due to a clotting disorder versus which had atrial fibrillation, based on the provided exome data. VTE (DVT or PE) may result from an environmental trigger, such as prolonged inactivity/immobility, recent surgery, or malignancies. However most of the VTE individuals in this study have not been subject to these factors, and have ‘unprovoked’ VTE.

Dataset description

The data set contains the following components:

A description of how the genomic data were collected is available in the methods section of (Daneshjou et al., 2014). Reference (Daneshjou et al., 2014) contains a large amount of other relevant information, including an analysis and prediction model developed by the dataset providers.

Prediction submission format 

The prediction submission is a tab-delimited text file. Organizers provide a file template, which must be used for submission. In addition, a validation script will be provided, and predictors must check the correctness of the format before submitting their predictions.

In the submitted file, each row must include the following tab-separated fields:

In the template file, cells in columns 2-4 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. If you choose not to enter a prediction leave the "*" in those cells. If you are not confident in a prediction for an individual, enter a large standard deviation for the prediction. Optionally, enter brief comments indicating the basis of the predictions; otherwise, leave the "*" in these cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

To submit predictions, you must create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Ethical consideration

The data in this challenge are derived from a cohort of individuals who are described as being of African ancestry, and more specifically, they are presented as being “African American,” a social category, in the sense that being a member of this group is determined by social relationships rather than by biological characteristics. Yet the prevalence of many gene variants differs in populations of different geographic ancestry and these differences may sometimes result in differences in disease risk. As a result, the social category of “African-American” cannot be fully or easily separated from the genetic implications of having recent ancestors from Africa. Population genetic differences have implications for genomic research; inclusion of all human populations is needed to ensure a complete understanding of the genetic contributors to health. But the goal of inclusion has not yet been achieved: under-representation of individuals of non-European ancestry persists. The lack of diversity in population genomic datasets presents a challenge to the meaningful interpretation of genomic findings in diverse populations, as well as a challenge in the just translation of genomic results into clinical care.

Attending to this challenge necessitates building and maintaining trusting relationships between researchers and participants, as well as addressing systemic racial inequities and historical power imbalances. This CAGI challenge addresses the apparent increased incidence of VTE in African-Americans compared to individuals of European ancestry, and seeks to inform a better understanding of genome variants that may contribute to risk of VTE through study of an African-American population.

However, it is critical for researchers to attend to the potential negative consequences of associating a particular disease variant with a specific population. Such associations must be drawn with care. Although prevalence of genomic variants may differ, few genomic variants associated with disease are found exclusively in one ancestral population. Also, it is critical to elucidate genetic and environmental correlates of a common disease. Because the category of “African American” is primary social, factors such as differential exposure to poverty and environmental risk factors, as well as discrimination that affects access to high quality health care, may account for differences in VTE risk. Although it may be useful to study variation within a single under-represented population, in order to ensure broader understanding of genetic contributors to a particular disease, researchers should use care not to assume that differences in disease risk among populations are due to a genetic cause.

In presenting results, researchers have an obligation to be attentive to the language they use when describing population categories. Associating genes or genetic variants with a particular population may lead to stigma. Further, some terms, such as “Caucasian,” are archaic categories associated with an era of scientific racism. For research to be valid, it must draw careful distinctions between geographic ancestry and social aspects of difference across human populations. Researchers can do this by paying careful attention to how they describe and report the populations they study. Scientists using this CAGI data set should consider these issues as they report their research results.

Dataset provided by

Roxana Daneshjou and Russ Altman, Stanford University School of Medicine


Dowling NF, et al. The epidemiology of venous thromboembolism in Caucasians and African-Americans: the GATE Study. J Thromb Haemost (2003) 1(1):80-87. PubMed 

Daneshjou R, et al. Population-specific single-nucleotide polymorphism confers increased risk of venous thromboembolism in African Americans. Mol Genet Genomic Med (2016) 4(5):513-520. PubMed 

Feero WG. Genetic thrombophilia. Prim Care (2004) 31(3):685-709. PubMed 

Middeldorp S, van Hylckama Vlieg A. Does thrombophilia testing help in the clinical management of patients? Br J Haematol (2008) 143(3):321-335. PubMed 

Roberts LN, et al. Venous thromboembolism and ethnicity. Br J Haematol (2009) 146(4):369-383. PubMed 

Rosendaal FR, Reitsma PH. Genetics of venous thrombosis. J Thromb Haemost (2009) Suppl 1:301-304. PubMed 

Zakai NA, McClure LA. Racial differences in venous thromboembolism. J Thromb Haemost (2011) 9(10):1877-1882. PubMed 

Revision history 

23 November 2017: Initial release 

30 November 2017: Closing date 

15 April 2017: Typos fixed 

23 April 2017: Challenge closed

24 September 2018: Dataset availability added