Proteomic Technologies Informatics Workshop - February 8-9, 2005 - The Fairmont Olympic Hotel,  Seattle, WA
View AgendaMeeting SummaryParticipant List
 
Meeting Summary Microsoft Word Document Link Download and Print Meeting Summary (Microsoft Word - Size: 290 kb)
Return to Table of Contents

Next >

Discussion Sessions

Each of the following sessions featured brief presentations, with the remainder of time allocated to group discussion and input.

Session 1: Use-Cases for a Proteomics Data Repository
Discussion Leaders:
John J.M. Bergeron, D.Phil., McGill University
Raju Kucherlapati, Ph.D., Harvard Medical School-Partners HealthCare System, Inc.
Philip Jones, M.Sc., European Bioinformatics Institute

Leaders discussed the capabilities of the specific mouse proteomic technology repositories necessary to enable their use to the proteomics community. Speakers were asked to consider two sources of users: consortia and members of the public. Workshop members discussed several anticipated uses of the consortia data, informatics tools development, and the data demands ( i.e. , raw versus processed) anticipated of the user community.

Dr. Bergeron:

The challenge is to make locally-developed approaches useful to the larger community. At McGill University , knockout models for proteins involved in the damage/regeneration of liver disease are studied. Using enrichment by clathrin-coated vesicles, tandem MS of subunits has been shown to be consistent with stoichiometric abundance. While all of the spectra can be assigned to peptides in various databases, the rat genome is constantly shifting. The vast majority of peptide clusters are assigned to five organelles. Examination of peptide clusters offers a visual tool to sift through large volumes of data.

The CellMapBase application, which is based on primary sequence rather than on protein name, is the backbone of the bioinformatics pipeline. CellMapBase consists of a protocol library plus repositories for files, images, and archive/backup/export. The annotation pipeline, moving from the CellMap database to the annotation database so that proteins are identified correctly, has proven challenging.

Discussion:

One participant inquired about use cases. Dr. Bergeron noted that users may submit tandem mass spectra, and the McGill group can determine with confidence if a spectrum can be assigned to a peptide or protein. To support the consortia, there are in-house methodologies to gather mass spectrometric data. McGill can work with the consortia to obtain raw tandem MS data or to evaluate whether a particular method is best for our database. The data are reprocessed on a regular basis to accommodate changes in reference databases. Consortium members may contact the McGill facility online to determine if specific methods are applicable.

One attendee suggested creating semantic meta-data registration, e.g., registering the meaning of all data fields, so that users know immediately whether their fields map to those specified at McGill.

Dr. Kucherlapati:

Dr. Kucherlapati discussed the "information lifecycle" that spans the analytical chemistry lab, collaborative efforts, and repositories of publicly-available data. The analytical chemistry lab creates a protein identification algorithm and laboratory information management system (LIMS) that will enable multitasking, collect required annotations, store instrument files, and facilitate proteomics processes and communication efforts. A Collaboration Data Management System is then needed to integrate data produced at different sites into a unified scheme that potentially enforces minimum annotation sets for collaborative analysis and to provide an environment for analysis across all collaboration data sets. Publicly-available data are then stored in experiment repositories ( e.g., PRIDE; see Jones presentation for details) or reference data repositories ( e.g., Blind, Swiss-Prot, or Protein Data Bank). The Harvard Partners Center for Genetics & Genomics (HPCGG) leverages its custom-built Gateway for Integrated Genomics-Proteomics Applications and Data System to provide a LIMS environment. Sequest is used for protein identification. The HPCGG is currently planning to leverage a customized version of the NCI's cancer LIMS (caLIMS) for the collaboration data management system.

However, there are several "chokepoints" in the information flow under the present design. First, high-throughput versions of protein identification algorithms rely on incomplete sequence databases. Moreover, proteins that are not adequately represented in the sequence databases may never flow across the link from the LIMS to the collaboration data management system.

Dr. Kucherlapati also noted that, given deficiencies in current sequence databases, polymorphic changes within proteins and post-translational modifications may increase the false-positive rate or incorrect assignments. While it is possible to add specific instances of these items into the database, it is essential to know what one is looking for upfront. However, more robust sequence databases will become available that will be dynamic and consistently improving. Moreover, protein identification algorithms are continuing to evolve in terms of sophistication and utility. However, information loss within the current information flow and problems caused by the transport and storage of large instrument files remain challenging.

Dr. Kucherlapati offered two general directions for potential solutions: facilitating movement of instrument files and facilitating movement of algorithms to data. To enable the former strategy, means to ensure that instrument data files can be transported to researchers who wish to analyze them algorithmically must be created. For the latter, remote reanalysis must be enabled for raw instrument files that are physically dispersed among their sites of creation. Data grid technologies may be useful for such a strategy.

Discussion:

Attendees discussed the key properties and questions that users would require of proteomics data and informatics systems. It was noted that intellectual property management will be critical; once published, supporting data must be made available. Pre-release of data depends on the nature of the data, although data and annotation must be comparable with that used in academic publication. Also, it will be essential for investigators to provide users with the information necessary to reproduce a given experiment. Recently, HUPO and the Plasma Proteome Consortium sent identical samples to 36 participating labs for analysis, yielding a slate of approaches and techniques for processing, analysis, database searching, and reporting. Thus, there is a great need for standardized, certified processes, which could in turn be referenced when an article is published. One participant noted that standardization may stifle innovation, but it was agreed that reporting to the community must be carried out through standardized processes.

Mr. Jones:

Mr. Jones discussed experiences with the PRoteomics IDEntifications Database (PRIDE; http://www.ebi.ac.uk/pride ), a data repository and data transfer format for protein and peptide identifications and supporting evidence. He observed that many requirements must be considered, including the nature of likely queries and of user response, the types of proteomic data to include, ways to promote and encouraged data submission, common standards for data exchange, and the level of detail included. A wide range of queries will likely be posited, including literature reference, protein identification, protein family, peptide, sequence, sample processing methods, environmental conditions, and parameters of search engines and instruments used. Addressing such needs requires common controlled vocabularies and ontologies ( e.g., species, tissue, disease, genotype, instrument), clear definitions of the products that will be returned to the user, and the formats of such returns. Controlling the volume of data is also essential; the sheer volume of raw data will swell the database to terabytes in magnitude, and peak lists will initially involve gigabytes and will swell to terabytes at later stage.

In addition to allowing data submission, the flexibility to exchange data is crucial. A successful model of a collaborative effort to achieve this goal is the Protein Standards Initiative (PSI) initiative for the exchange of protein interaction data using the PSI Molecular Interaction XML format. The PSI General Proteomics Standards (GPS) Workgroup is developing data formats for submission and inter-repository exchange that include the Minimum Information about a Proteomics Experiment (MIAPE), the PSI object model, the PSI/GPS ontology, and data exchange formats such as mzData (for instrument output and peak lists) and mzIdent (for peptide and protein identifications).

PRIDE has addressed these problems by offering:

  • An XML schema for transfer of proteomics protein identification data
  • A relational database implementation for the data repository and a central data repository, with the intention of implementing a network of federated databases
  • Secure upload of proteomic data in the PRIDE XML schema
  • The ability to search the repository and download results in PRIDE XML or HTML formats
  • This set of tools, made available and open-source upon release

Discussion:

Participants made several comments and suggestions regarding the efforts at the European Bioinformatics Institute (EBI), including working to devise ways to link general repositories to specific repositories based on a set of common standards for data transfer. It was also noted that the exact nature of some post-translational modifications, such as glycosylation, cannot be mapped to the EBI vocabularies. Thus, EBI should consider biological questions as well when designing annotations used in its systems.

General Discussion:

Dr. Lance Liotta commented on the public response to the raw proteomics datasets that his group provided online as part of projects with the NCI. He noted that the NCI felt that the field would benefit from raw data sets generated as platforms were modified and developed. These data were partially-analyzed. In response, hundreds of people analyzed the data using their own methodologies, and feedback suggested both improved analytic methods and inabilities to reproduce the data. However, in some instances, the data were analyzed and papers were published without discussions with Dr. Liotta's research group. He therefore stressed the need for communication between those who post data and those who analyze it and publish their results. He noted that the concept of smaller groups that share data initially before posting ( e.g., PRIDE) is a good idea. Also, he urged data posters to consider protections of confidentiality for human sample databases, as it cannot be assumed that users will communicate how they plan to analyze the data or communicate the results. One attendee commented that this example illustrates the importance of meta-data standards.

Participants then discussed the needs for biologists as users of proteomic data. It was observed that most biologists will not read tables of proteomic data; identified proteins must be shown to correlate with phenotypic relevance. Because conditions such as the CO 2 level and tissue-culture techniques affect the proteome, this presents a major problem for biological use. Another participant noted that transcriptomic and genomic correlation is a key to making effective use of proteomic data.

One attendee commented on the difficulty of enforcing community standards for analysis and suggested that the field consider the example set by the HGP. When the data are made available, the community will develop the tools necessary to make these data biologically relevant. One attendee asked whether the proteomics informatics community is positioned to influence editors' policies for accepting manuscripts in tandem with the ability to release data. In response, it was noted that the modified data-release policy for most of the proteome data must recognize efforts of the sequencing and bioinformatics groups, with timely release being key.

It was also observed that standards for bioinformatics cannot be divorced from those for methodologies; both must be developed in parallel. Also, the analog data from MS differs from the digital data from the human genome, and the digital format embeds a certain level of objectivity.

Return to Table of Contents

Next >



Home | View Agenda | Meeting Summary | Participant List

National Cancer Institute
National Institutes of Health
U.S. Department of Health and Human Services