CAGI Scientist Opening: apply here ☀️

Using exome sequencing data from a family, identify which individual has hypoalphalipoproteinemia (HA)

Dataset description: public

Exome sequence data: registered users only, limited by CAGI Data Use Agreement

This challenge closed on 25 April 2013. 

HA Challenge answer key (23 KB, docx): registered users only, limited by CAGI Data Use Agreement

Assessor summary (98 KB, doc): registered users only, limited by CAGI Data Use Agreement

Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement

      Angel Mak: Data Provider (2.5 MB, remixable ppt)

      Shamil Sunyaev: Assessment (5 MB, remixable ppt)

      Lipika Ray: Predictor Talk (1.8 MB, remixable ppt)

      Emanuela Leonardi: Predictor Talk (4.3 MB, remixable ppt)

      Yanay Ofran: Predictor Talk (1.8 MB, remixable ppt)

      Nathaniel Pearson: Predictor Talk (5 MB, remixable ppt)

Predictions (594.5 KB, zip): registered users only, limited by CAGI Data Use Agreement


Hypoalphalipoproteinemia (HA) is characterized by severely decreased serum high-density lipoprotein cholesterol (HDL-C) levels and low apolipoprotein A-1 (APOA1). Low HDL-C is a risk factor for coronary artery disease.

Prediction challenge

In a family where one person has HA, predict which individual has HA, as characterized by extremely low serum HDL-C. It may be helpful to know that the affected individual also has additional phenotypes of hepatosplenomegaly, lymphadenopathy, and short stature.

Dataset Information

The dataset contains variant information for the four subjects in the family extracted from exome sequencing data. The exome sequencing data was generated using the Illumina HiSeq2000 platform with the TruSeq exome enrichment protocol and processed using automated pipeline for next-generation sequencing data (Chapman, Reads were aligned to hg19 by BWA and SNPs and Indels were called by GATK and Dindel.

Variant information for the four subjects is given in separate files: HA_13.vcf (daughter), HA_14.vcf (father), HA_15.vcf (son), HA_16.vcf (mother)

The Datasets are only available for registered users, please log in to access the data.

Prediction submission format 

In the template submission file, the first line (row) of the file contains header information, followed by a separate line for each individual. The first column indicates the individual. In Columns 2 and 3, please provide the probability P(abnormal) of that individual having the phenotype of an extreme HDL-C level as well as hepatosplenomegaly, lymphadenopathy, and short stature and the standard deviation thereof (confidence in the prediction of P). The probabilities P should be in the range (0-1).

Format of Prediction Submission Template

P (abnormal)    Standard deviation





Please use the submission file template provided for your submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

Download submission template

Download validation script (not available)

The submission template file is a tab-delimited, plain text file, where all blank cells are marked with an "*". Please submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a patient, please leave the sign "*" in these cells. Please make sure you follow these submission guidelines strictly.

Methods description 

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Data Providers

Cardiovascular Research Institute, University of California San Francisco

Nina Gonzaludo, Clive R. Pullinger, Paul L.F. Tang, Mary J. Malloy, John P. Kane and Pui-Yan Kwok


This challenge is being assessed by Shamil Sunyaev, Harvard University.