PROJECT RATIONALE
Immunogenetics is an international field with a history that spans more than fifty years. Immunogenetic genotyping, data-generation and data-analysis methods have proliferated in this time, in many cases in inconsistent ways. We have identified five areas where (often unavoidable) inconsistency can be introduced into genotype data; these inconsistencies in turn contribute to variation in the analysis of immunogenomic data, and the interpretation of such analyses.
Challenges To Consistency In Immunogenomic Data Management And Analysis
Variation in Typing Methodology
The large variety of typing methods in use results in datasets being generated with varying levels of resolution, so that a particular allele (or set of equivalent alleles) may be identified in a sample using one method, while a slightly different allele (or set of equivalent alleles) may be identified in the same sample, using a different method.
Changes in Nomenclature
The continual evolution of the nomenclature conventions over time has resulted in a progression of allele and locus identifiers that are not always easily inter-related. It is usually the case that the nomenclature used to identify the alleles in a dataset is "frozen" at the time when that dataset is generated and published, which can make comparisons with datasets generated under successive nomenclature conventions difficult.
Variation in Data Management Standards
The lack of clear standards with regard to the manner in which immunogenetic data are recorded and stored often makes it difficult to integrate datasets for meta-analyses, or to even share data between research groups. This becomes especially problematic with very large datasets, as the process of reformatting data is often accomplished by hand.
Variation in Ambiguity Reduction Methods
When polymorphisms that distinguish alleles are not assessed (e.g., because they are in an exon that is not interrogated by the typing method employed), the result is allelic ambiguity, where exact identity of one or both of the alleles present in a given sample at a given locus cannot be known. When it is not possible to establish phase between key polymorphisms common to many alleles, the result is genotypic ambiguity, where multiple genotypes are possible for a given sample. The choice of one allele over another, or of one genotype over another, is the process of ambiguity reduction. Currently, there is no standard method for reducing these ambiguities, and different research groups may apply different methods to the same set of ambiguities, resulting in different alleles and genotypes being chosen for the same typing result.
High Polymorphism
The high level of polymorphism associated with these genetic systems presents particular bioinformatic, statistical and computational challenges that have yet to be addressed in a standardized manner. For example, haplotype estimation, case-control association studies and test of fit to HWE are all subject to biases introduced by large numbers of low-frequency alleles (aka, sparse cells). Guidelines based on theoretical and empirical considerations are required for each method in order to insure consistency in the application and interpretation of these analyses.
Solutions for Immunogenomic Data Management and Analysis
Given these challenges to consistent data-analysis, the Immunogenomics Data-Analysis Working Group proposes to develop data equivalency standards intended to foster consistency in the use of extant and future analytical methods, and to develop novel statistical and computational methodologies for the analysis of highly polymorphic loci.
In addition, we will determine the impact of various standards and methods for data mangement on downstream data-analyses, comparing them to extant immunogenetic data analysis systems, and producing recommendations for consistency in the analysis of highly polymorphic datasets.
Finally, we will promote widespread accessibility and application of these novel data equivalency and analytical tools by making them available to the community using web-based and multi-platform approaches.