Immunogenomics Data-Analysis Working Group : HFSP Letter of Intent

HFSP LETTER OF INTENT

a. Title
Solutions for Immunogenomics Data Management and Analysis

b. Keywords
Immunogenomics, evolutionary genetic analysis, clinical outcome analysis, biostatistics, data analysis

c. Which disciplines are represented among the members of your team
Bioinformatics, computational biology, genetics, statistics, mathematics, immunology, computer science, evolutionary genetics

d. Summary of overall project
The goal of the proposed work is to develop and validate novel analytical tools, methods, and standards for sharing highly polymorphic immunogenomic data (e.g., HLA, KIR, LIR) and the consistent application and interpretation of analyses by the immunogenomics and genomics communities.

Varying levels of resolution between data sets, a lack of clear standards with regard to data capture and the documentation of data processing and temporal nomenclature variation result in under utilization of publicly available immunogenetic data (e.g., for subsequent re-analysis or meta-analysis) and inconsistent analysis of these data. The high level of polymorphism associated with these genetic systems presents particular biological, statistical and computational challenges that have yet to be addressed in a standardized manner.

We propose to develop data standards that will ensure consistency between data sets from various typing platforms in extant and future analytical methods, promoting widespread accessibility and application. We will determine the impact of the application of these novel standards and methods on downstream data-analyses, compare results to those based on current practice and produce recommendations intended to foster consistency in the analysis of highly polymorphic data sets.

Principal Applicant - STEVEN MACK
Dr. Mack and his colleague Dr. Jill Hollenbach will coordinate the development of the tools and methods proposed in this study. Together with their colleagues Dr. Derek Middleton and Dr. Diogo Meyer, Dr. Mack and Hollenbach will implement a data-analysis "pipeline" to validate the efficacy of these project-generated resources on a large, heterogeneous collection of population, disease association, and family-study data sets that have been generated using a variety of methods and under a range of nomenclatures. A critical aspect of this pipeline will be the incorporation of data generated via high-throughput sequencing technologies. Additionally, an infrastructure will be developed to allow public access to existing data, expanding upon the applicants’ current efforts in this regard (e.g., http://pypop.org/popdata, http://www.allelefrequencies.net and http://www.ncbi.nlm.nih.gov/gv/mhc/). In addition, web-based tools will be developed and integrated into a project web-site to allow the application/utilization of the methods developed under the proposed study, permitting a consistent and accessible approach to future meta-analysis of these data.

Co-Applicant 2 - RICHARD SINGLE
Dr. Single will create a general framework for carrying out assessments of the impact of assumptions made in the data generation/curation stage. Specifically, Dr. Single will carry out a resampling study in order to quantify the impact on downstream analyses of different ways of resolving ambiguities (inability to distinguish between specific variants due either to unassessed polymorphisms or an inability to establish phase between polymorphisms). This study will use the likelihood of each different ambiguous genotype for each individual and generate replicate data sets where ambiguities are resolved probabilistically based on these likelihoods. Each of these replicate data sets will be used in downstream analyses to generate a distribution of results for any given test statistic. Additionally, Dr. Single will develop methods to address the issue of assumptions of equilibrium vs. non-equilibrium conditions for tests of neutrality and haplotype estimation. The results generated in this portion of the proposed work will inform the recommendations for application of core analytical methods as well as de novo analytical methods and implementation of web-based tools developed as part of this proposal.

Co-Applicant 3 - WOLFGANG HELMBERG
DNA typing in highly polymorphic systems has the potential of cross reacting with highly similar paralog loci. A system that stores the actual typed sequences is essential if former observations are to be re-interpreted in the light of newly discovered alleles or loci, or to retain the ability to interrogate the data for specific sequence motifs. As an example, the main challenge of KIR is its genomic content variation combined with allelic variation and highly paralogous loci. Dr. Helmberg’s initial task in this group will be to develop and support a system with the versatility to deliver the correct interpretation of whatever kind of genotyping data have been accumulated and perform necessary ambiguity reduction, given the allelic differentiation at any point in time. This system will expand on the system he has designed for typing of HLA loci and will preserve vital sequence-level information. Additionally, Dr. Helmberg will develop methods to analyze these data at the DNA and amino acid sequence levels, in order to identify particular motifs that are important in disease and human adaptation. These tools and methods will be applicable to any highly polymorphic gene system.

Co-Applicant 4 - STEVEN MARSH
Dr. Marsh and his colleague Hazael Maldonado-Torres propose to develop a dynamic and secure distributed system that will operate as a shared communication channel to coordinate parallel and distributed computations originated in biocomputing algorithms used by the immunogenomics and genomics communities.

This system will have a service-oriented architecture consisting of a tuple space and groups of well-behaved network services, clients, and servers representing implemented analyses, task collectors (desktop or web applications), and computing units, respectively. It will operate on commodity technology and will offer a scalable, evolvable, and flexible network system in which its instances (services, clients, and servers) can freely and transparently join or leave the system.

Dr. Marsh and Maldonado-Torres also aim to adapt current implementations of analytical tools and methods by converting them from sequential to parallel/distributed algorithms to be included as services, e.g. exact tests, Monte Carlo simulation tests, and Expectation-Maximisation tests. Additionally, they plan to develop trimming strategies for the Hardy-Weinberg enumeration test and derivatives, as well as divide-and-conquer strategies to reduce the number of combinations to generate.

f. Key references related to the project
Helmberg W, et al., Virtual DNA analysis--a new tool for combination and standardised evaluation of SSO, SSP and sequencing-based typing results. Tissue Antigens. 1998 Jun;51(6):587-92.

Helmberg W, et al., Virtual DNA analysis as a platform for interlaboratory data exchange of HLA DNA typing results. Tissue Antigens. 1999 Oct;54(4):379-85.

Chen JJ, et al., Hardy-Weinberg testing for HLA class II (DRB1, DQA1, DQB1, and DPB1) loci in 26 human ethnic groups. Tissue Antigens. 1999 Dec;54(6):533-42.

Mack SJ, et al., Methods used in the generation and preparation of data for analysis in the 13th International Histocompatibility Workshop. 13th International Histocompatibility Workshop Anthropology/Human Genetic Diversity Joint Report. In: J.A. Hansen, ed. Immunobiology of the Human MHC: Proceedings of the 13th International Histocompatibility Workshop and Conference, Volume I. pp. 564-579. Seattle, WA: IHWG Press, 2007.

g. Innovative aspects
The proposed work will produce a coherent, consistent analytical strategy for immunogenomic data that is absent from current immunogenetic research. This strategy will reduce extant barriers to complex analyses of highly polymorphic data sets through the application of data recording and documentation standards and the availability of web-based analytical tools. Further, the impact of discrete data management and data processing methods on downstream analyses will be quantified, providing guidance for ongoing investigations throughout the immunogenomics field.

h. Collaborative elements
Team members have attempted to resolve the issues addressed in this project independently, but the challenges associated with immunogenomic data require an interdisciplinary approach: methodologies developed by Dr. Helmberg will be employed by Dr. Mack, Hollenbach, Meyer and Middleton. Results will be compared using Dr. Single’s methods to collect recommendations for the implementation analytical methods developed by Dr. Marsh and Maldonado-Torres. The results of each of these steps will "feed back" to inform the development of new steps in an integrated manner.

i. Interdisciplinarity
Dr. Mack and Hollenbach are immunogeneticists/genetic epidemiologists while Dr. Single is a statistician. Whereas the former are adept in the use of statistical methods for specific purposes, Dr. Single is skilled in the development of new statistics and novel statistical applications.