The Immunogenomics Data-Analysis Working Group : 2010 EFI Working Group Meeting Summary

2010 EFI IDAWG MEETING SUMMARY

The working group met during the EFI meeting in Florence, Italy on 15 May, 2010

In Attendance:
Pierre-Antoine Gourraud, Wolfgang Helmberg, Jill Hollenbach, Steven Mack, Martin Maiers, Derek Middleton, Carlheinz Müller

We were unable to skype with other members of the working group due to a lack of internet access in the meeting room.

Agenda items:
1 - Introduction
We noted that the IDAWG first met as a group at the previous year's EFI meeting in Ulm Germany.

2 - Tissue Antigens Commentary
Working group members have agreed to write particular sections, and we discussed who has agreed to author which sections. We will distribute the working draft of the commentary to the working group for editing as a set of Google docs. While there is no set due date, we are hoping to have the commentary completed within 6 months.

Given the scope of the topics in the commentary outline, we discussed limiting the commentary to issues of data-management (along the lines of the Silver Standard), and distributing a broader standards 'white paper' online.

We agreed that it would be very useful to have a standard or set of standards published that can be cited in papers.

3 - 16th Workshop Projects
Ambiguity reduction standards
We agreed that it would be premature for the goal of this workshop project to be to objectively identify "good" or "bad" ambiguity reduction approaches, but debated the possibility of identifying 'best practices' for specific applications (e.g., for anthropology studies), or taking a neutral position and simply advocating for the expanded documentation of ambiguity reduction methods in publications. There was consensus that we focus on an assessment of where the community stands currently with respect to ambiguity reduction approaches.

We also noted that different labs may perform ambiguity reduction differently depending upon the application (e.g., HCST versus anthropology studies). We discussed a future need for BMDR’s to move away from 2-digit resolution HLA typing, and with that, a need for ambiguity reduction standards, reiterating the need for different ambiguity reduction standards for different applications.

We discussed the difference between ambiguity reduction and typing resolution reduction (i.e., from 4 to 2 digits); this speaks more to the issue of having separate studies typed with different technologies, and trying to interrelate them, but is more of a data-management issue.

We discussed the potential for this project to determine the extent to which differences in ambiguity reduction methods affect downstream analyses. Toward this end, we proposed to survey the community by distributing model ambiguous data sets, with the goal of having participating groups reduce the ambiguities and document how that was done, and then look at differences in common analytical outcomes. It seems likely that this survey can be fairly simple and that participation can be relatively painless for interested groups.

Simple data management tools development
The central idea for this project is to develop tools that use a common same data-sharing format for input and output, so that the output of one tool could be used as input for others. The IDAWG would come up with a useful data-sharing format, so that anyone in the working group or the larger community that wanted to develop tools could use that format, which would also foster community data-sharing standards.

This would produce a tangible asset after the workshop, and we discussed the need for resources and personnel to keep, for example, a web-based venue for these tools up and running.

There was consensus that the IDAWG’s role in the project would be to stipulate a data format, platform, etc., and it would be up to individual project participants to develop tools that fit into that framework.

The point was made that if we have a standard for how to share data, tools will be developed according to that format. With that in mind, the most important tool for the IDAWG to develop would be a data-format validation tool, so that there is a means to ensure that data properly adhere to the format.

Concerns were raised that this may be too much to accomplish in time for the 16th Workshop.

Novel data-analysis methods development
We discussed the idea of producing a methods manual, and bringing in people in the community who are actively engaged in these efforts in the context of the workshop. As the IDAWG was envisioned as a way to bring the interested parties together, the hope is to use the workshop as forum for discussing new approaches to data-analysis, even if there is a not a final ‘product’.

Reporting guidelines for journals
The goal of this project would be to develop a STREGA-like statement for immunogenetic studies; the name STREISS (STrengthening the Reporting of Immunogenetic StudieS) was proposed. There was consensus that the first step toward this is to develop the Commentary for Tissue Antigens.

Other project ideas for the Workshop
We discussed options for cooperating with the HLA-NET project, and reiterated our commitment to the IDAWG workshop projects being open to anyone that wishes to participate.

We concluded these 16th Workshop project discussions with the acknowledgement that we need to communicate our goals and opportunities for collaboration to the community, and hope that the Tissue Antigens commentary will go a long way towards generating community involvement.

4 - Data sharing standards
We discussed some specific ideas for data-sharing standards, with an example based on the GLstring data reporting format used by the KIR community. The GLstring format uses a hierarchy of operators to define KIR genotypes.

We discussed the use of accession numbers as a nomenclature-independent means of identifying alleles in ambiguous HLA or KIR genotypes, and debated the merits of the complete enumeration of ambiguities versus the use of collapsed ambiguity strings. The NMDP already uses accession numbers in this fashion.

We discussed expanding the GLstring format so that it could be applied to any gene-system with a centralized (accession number based) allele nomenclature. A lively discussion ensued regarding possible ways to encode genotypes, and the application of various operators to explicitly represent ambiguity and minimize the loss of information. This was followed by more lively discussion about the use of XML.

Tools could be developed to specifically handle this type of data, which would not necessarily be human readable. However, at any point in an analysis the tool could be queried and traditional human-readable allele-names would be provided.

We discussed at length various examples of how this encoding would work and what an HLA genotype would look like, and the manner in which all of the locus and allele information would be embedded within these codes.

There was considerable discussion of accession numbers that pertained to the downside of using of leading zeros, and the difference between coding alleles as numbers/integers versus as alphanumeric strings. It was also noted that the accession numbers for HLA or KIR alleles in Genbank are inconsistent with the accession numbers in the IMGT/HLA database (e.g., the Genbank accession number for the A*01:01:01:01 allele is GU812295, while the IMGT/HLA accession number is HLA00001).

This discussion made clear that the coding of ambiguities it is not a simple prospect; overall, there were many good suggestions made, along with agreement that there is a lot of promise to this approach if the details can be thoroughly worked out.