Critical Assessment of Genome Interpretation

Familial combined hyperlipidemia: Identify mutation(s) conferring LDL-C disease phenotype and identify individuals with abnormal TG and HDL-C levels

Dataset description: public

Exome sequence data: registered users only, limited by CAGI Data Use Agreement

This challenge closed on 25 April 2013.

FCH Challenge answer key (23 KB, docx): registered users only, limited by CAGI Data Use Agreement

Assessor summary (98 KB, doc): registered users only, limited by CAGI Data Use Agreement

Slides from the CAGI conference: registered users only, limited by CAGI Data Use Agreement

Angel Mak: Data Provider (2.5 MB, remixable ppt)

Shamil Sunyaev: Assessment (5 MB, remixable ppt)

Lipika Ray: Predictor Talk (1.8 MB, remixable ppt)

Emanuela Leonardi: Predictor Talk (4.3 MB, remixable ppt)

Yanay Ofran: Predictor Talk (1.8 MB, remixable ppt)

Nathaniel Pearson: Predictor Talk (5 MB, remixable ppt)

Predictions (1.2 MB, zip): registered users only, limited by CAGI Data Use Agreement

Background

Familial combined hyperlipidemia (FCH), http://www.omim.org/entry/143890, the most prevalent hyperlipidemia, is a complex metabolic disorder characterized by variable occurrence of elevated low-density lipoprotein cholesterol (LDL-C) level and high triglycerides (TG)—a condition that is commonly associated with coronary artery disease (CAD).

Prediction challenge

This is a two-part challenge involving exome sequencing data provided on five individuals in a FCH family.

In the first challenge, you are given which family members have elevated LDL-C, and asked to predict which mutation(s) confer the elevated LDL-C phenotype.
In the second challenge, you are asked to predict which individuals have abnormal TG and which individuals have abnormal high-density lipoprotein cholesterol (HDL-C) levels (P and SD for each lipid and for each individual).
Predict TG and HDL-C levels for all the family members (value and SD). (This challenge will not be considered for ranking predictors in assessment).

Dataset Information

The dataset contains variant information for the five subjects in the FCH family extracted from exome sequencing data. The exome sequencing data was generated using the Illumina HiSeq2000 platform with the TruSeq exome enrichment protocol and processed using automated pipeline for next-generation sequencing data (Chapman, https://github.com/chapmanb/bcbb/tree/master/nextgen). Reads were aligned to hg19 by BWA and SNPs and Indels were called by GATK and Dindel.

Variant information for all the five subjects is given in a single vcf file: Patient-21 (mother with elevated LDL-C phenotype), Patient-20 (father normal), Patient-18 (daughter with elevated LDL-C phenotype), Patient-17 (daughter normal), and Patient-19 (daughter with elevated LDL-C phenotype).

Note added about variant calls: In this dataset the variants were called separately for each patient, with a variant being called only if it differed from the reference. For any variant thus called, there is only 0/1 (variant is het) or 1/1 (variant is hom) call. (Note that the vcf file won't show a position is same as reference, so there is no 0/0). The merged vcf files (with variants from all 5 individuals) were combined using vcf-merge from vcf-tools. This means that if a variant is found in one patient but not in another, vcf-merge has no information whether it is 0/0 or no call, so it gives ".".

The Datasets are only available for registered users, please log in to access the data.

Prediction submission format

Three submission templates are provided: The first for the causative variant(s), the second for the lipid disease predictions, and the third for predicting TG and HDL-C levels. Please use the submission file templates provided for your submission. In addition, validation scripts are provided, and predictors should check the correctness of the format before submitting their predictions.

In the first template submission file, the first line (row) of the file contains header information indicating the columns. The first column "variant" contains a list of variants that you believe to be causative for the elevated LDL-C phenotype, and the second column contains a numerical "prediction" factor for that variant. Thus each line should contain the following two columns (tab-delimited):

Variant - In the left-most column please indicate the variant in HGVS format. When identifying a specific gene, please include the gene name in front of the variant with a colon (e.g., NM_004006.1(DMD):c.3G>T).
Prediction - In the next column should be a value in the range (0 – 1). The values in column 2 ("Prediction") should sum up to 1. For example, if you submit 15 variants for the elevated LDL-C phenotype, then you could have 5 variants with p=0.1 and 10 variants with p=0.05.

You may list up to 100 variants, i.e., the submission file may contain up to 100 rows.

Download submission template1

Download validation script (not available)

In the second template submission file, the first line (row) of the file contains header information, followed by a separate line for each individual. The first column indicates the individual. In Columns 2 and 3, please provide the probability P(abnormal TG) of an abnormal TG level and the standard deviation thereof (confidence in the prediction of P). In Columns 4 and 5, please provide the probability P(abnormal HDL-C) of an abnormal HDL-C level and the standard deviation thereof (confidence in the prediction of P). The probabilities P should be in the range (0 – 1).

Format of Prediction Submission Template 2

P (abnormal TG) Standard deviation P (abnormal HDL-C) Standard deviation

Patient-21

Patient-20

Patient-18

Patient-19

Patient-17

Download submission template2

Download validation script (not available)

In the third template submission file, the first line (row) of the file contains header information, followed by a separate line for each individual. The first column indicates the individual. In Columns 2 and 3, please provide the predicted TG level (mg/dL) and the standard deviation thereof. In Columns 4 and 5, please provide the predicted HDL-C level (mg/dL) and the standard deviation thereof.

Format of Prediction Submission Template 3

TG(mg/dL) Standard deviation HDL-C(mg/dL) Standard deviation

Patient-21

Patient-20

Patient-18

Patient-19

Patient-17

Download submission template3

Download validation script

All submission template files are tab-delimited, plain text files. The first template file will contain as many lines as variants predicted (plus the header line). In the second and third template submission files, all blank cells are marked with an "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission; if you cannot submit predictions for a patient, please leave the sign "*" in these cells. Please make sure you follow the submission guidelines strictly.

Methods description

In addition, your submission should include a detailed description of the method used to make the predictions (similar to the style of the Methods section in a scientific article). This information will be submitted as a separate file.

To submit predictions, you need to create or be part of a CAGI User group. Submit your predictions by accessing the link: "All submission forms" from the front page of your group. For more details, please read the FAQ page.

Data Providers