Proteomic Technologies Informatics Workshop - February 8-9, 2005 - The Fairmont Olympic Hotel,  Seattle, WA
View AgendaMeeting SummaryParticipant List
 
Meeting Summary Microsoft Word Document Link Download and Print Meeting Summary (Microsoft Word - Size: 290 kb)
Return to Table of Contents

Next >

Session 2: Specimens, Experimental Annotations, and Data Quality
Discussion Leaders:
Steve Carr, Ph.D.,The Broad Institute
Eric Deutsch, Ph.D., Institute for Systems Biology
David J. States , M.D., Ph.D., School of Medicine , University of Michigan

Leaders discussed the specimen, experimental, and data analysis annotations required of the data repositories to be of use to the scientific community. Attendees discussed practical recommendations for generating informatics systems in the face of rapidly developing standards and responded to the use-case scenarios presented in the previous session.  

Dr. Carr: Guidelines for Publication of Peptide and Protein-Identification Data

Speaking on behalf of the Molecular and Cellular Proteomics Working Group on Publication Guidelines, Dr. Carr noted that the dramatic increase in the number of large dataset papers being published has led to an inability to determine if results of peptide and protein identification are valid. Published studies often contain insufficient information for the reader to assess methods for data processing or protein identification criteria. Thus, it is likely that many incorrect interpretations are being published. The goals for these publication guidelines include:

  • Try to ensure that high-quality, significant data are entering the proteomics literature
  • Develop minimal guidelines for publication of peptide and protein identification data in molecular and cellular proteomics
  • Focus initially on how identifications are made and validated
  • Create guidelines that are neither burdensome nor dictatorial
  • Initiate the process for requiring submission of data as a condition of acceptance for manuscripts and the logistics involved in such a process

He noted that finding a peptide match in a database is relatively easy, but knowing whether it is correct is not. It is always possible to match a tandem mass spectrum to a peptide in the database, yet incorrect matches often result from the use of low-quality peptide tandem mass spectrometric data to search the database. Most algorithms use a model based upon an empirical threshold that serves as a "cutoff" value. As such, each algorithm is associated with an unknown and variable false-positive error rate. While statistical methods to validate peptide assignments to tandem mass spectra have shown promising results, none is widely available or accepted at present.

The guidelines proposed by the working group ( Mol Cell Proteomics 2004;3:531) include:

  • Describe the search engine used and how peptide and protein assignments were made using that software, including thresholds and values specific to judging the certainty of identification and description of how applied
  • Provide sequence coverage observed for each protein identified
  • Increase the stringency of information required to use single peptide identifications for protein assignment
  • Describe how the number of unique proteins identified was counted based on the peptides found
  • Report the methods used to derive quantitative results from proteomic datasets (under development)

Dr. Carr noted that the proteomics community is data-starved, which inhibits refinement and comparison of new algorithms. Recognizing that integration and collective analysis are likely to yield new knowledge, Molecular and Cellular Proteomics strongly encourages submission of all tandem mass spectra mentioned in a paper as supplemental material. The journal is moving toward accepting and serving raw or minimally-processed intact liquid chromatographic/tandem mass spectrometric datasets. However, storage on journal websites is not a viable long-term solution, underscoring the need for creating public repositories.

Recommendations to the mouse models consortia to handle data include:

  • Follow the Molecular and Cellular Proteomics guidelines
  • Use common search algorithms and database to search
  • Employ statistical methods to evaluate the false-positive rate
  • Plan to integrate data for searching to identify weak associations not evident in single datasets
  • Employ common/consistent annotation of results
  • Store data in the original instrument vendor format in as minimally-processed a form as possible

Discussion:

One participant inquired if the journal has asked submitters to include a set of standards with their data, and the answer is no. In parallel, however, people are providing sets of highly-curated tandem mass spectrometric data.

Another attendee inquired if this effort reflects wider, community-based efforts and whether the stringency of the guidelines eliminates some biologically-valid identifications. Dr. Carr noted that the guidelines reflect realities by asking submitters to justify their results more stringently. He commented that a consensus view is the ultimate goal; if the community indicates that these guidelines are too stringent, then they will be modified.

Another participant noted that attempts to decrease file space are limited and recommended that the journal request data in a certain format ( e.g., mzXML).

Dr. Deutsch:

Dr. Deutsch commented on specimen annotation, noting that the more complex the mechanism for annotating specimens, the richer the query selection can be, the less likely that the annotations will be completed, and the longer time required to develop a good interface. Microarray and mouse community databases are sources for specimen annotation guidelines.

Points to consider include:

  • Plan how annotations will map to developing standards ( e.g., microarray gene expression object model (MAGE-OM), functional genomics experiment object model (FuGE-OM)). MIAPE and the Minimum Amount of Information about a Microarray Experiment (MIAME) provide good roadmaps, and integration with microarray data will be needed.
  • Plan how annotations being captured will integrate with existing repositories from the microarray ( e.g., ArrayExpress, the Gene Expression Omnibus) and mouse communities ( e.g., eMAGE).
  • Require standard characteristics for common queries ( e.g., organism, strain, disease state, cell type).
  • Use existing ontologies and predefined lists where possible ( e.g., Mouse Anatomical Dictionary, Microarray Gene Expression Data Ontology, the Digital Anatomist Foundation Model (FMA), eVOC, Open Biological Ontologies)
  • Allow "anything else you've got" annotations ( e.g., free text, protocols, arbitrary attached documents). While these may not be searchable, valuable information is retained.
  • Consider organismal independence (see the ISB's Peptide Atlas; http://www.peptideatlas.org ). Even though the current goal is a repository for mouse model proteomic data, the next requirement will be data from another organism, such as human or rat.
  • Hire curators for whom a tidy, complete repository is a passion, to serve as a bridge between programmers and researchers.

Discussion:

One attendee commented that certain peptides in a protein are more likely to be identified, and confidence increases if the number of hits for an entry is high. Does such an observation impact the development of protein identification tools? Dr. Deutsch mentioned the Prototypic Peptide Predictor, a tool currently under development that will show the peptide within the protein and process all possible permutations to predict its likelihood of being identified.

Another participant inquired about transforming datasets to mzXML and mzDATA. The need for generic schema was noted, and one participant commented that MAGE version 2 will have the capacity to describe specimens in any format.

An attendee observed that writing a standard and convincing a community to use it are distinct challenges. Because most labs do not have the informatics resources to adopt state-of-the-art identification tools, data submission tools are critical.

Dr. States:

Dr. States began by noting that genomics had advantages over proteomics in terms of less tissue variation, one copy of each gene per genome, few sample handling issues, and simpler considerations regarding modification. Large-scale genomics efforts offer many lessons, including developing a framework for error identification, setting standards, and validating lab performance. Noting that applications drive accuracy requirements, Dr. States noted that the error rate falls as sequencing costs increase. He highlighted several quality assurance exercises from the HGP, including using a Cooperative Research and Development Agreement (CRADA) funding mechanism, blind resequencing of test samples, estimation of error rates only after completing a megabase of sequencing, and telescoping the eight sequencing labs into three centers that locked in the major technology choices.

Regarding proteomic analyses, Dr. States commented that abundance is the single most likely predictor of whether a protein will be detected. Identifications may be highly significant even if they are not reproducible. Also, an observation must be reproducible within the original lab. Proteins can be identified at several levels, including member of a gene family, gene product, post-translational modification, transcriptional/splice variant, and complete covalent structure.

Issues in project coordination include multiple permitted formats for data submissions to databases, choice of LIMS, division of responsibilities, data storage, and project coordination. For the Eastern Consortium, the c hoice of whether to implement a local LIMS and whether to use the NCI's caLIMS ( http://calims.nci.nih.gov/developers/ ) forms and interfaces within the lab is entirely up to the lab. caLIMS offers no explicit support for proteomics or genetics (it was designed for molecular biology), is generic, is integrated with caBIG, and can be adapted to the Mouse Models of Human Cancer Consortium or the Plasma Proteome Project. However, caLIMS data definitions provide a common vocabulary.

Dr. States noted also the danger in imposing too much rigidity in quality control during the early stage of proteomic technology development. Although error processes and accuracy requirements need to be more carefully defined, informatics support in the labs is currently limited. He stressed also the need for project coordination. A division of labor between individual labs and the consortium data center, archiving of data at multiple levels ( e.g., raw, processed, analyzed), and the early and inclusive definition of variables will all enhance project progress.

Discussion:

Participants discussed issues related to the publication of proteomic data. It was noted that the literature is ambiguous because results being published are often derived from single experiments. Aggregate data sets across labs will help to make the associations derived from literature analysis much stronger. It was suggested that the number of fractions analyzed and the number of replicate runs per sample be included in publication submissions.

It was also noted that complex mixtures will likely yield divergent and unusual results. Tools such as ProteinProphet were recommended to reduce protein identifications based on single peptides. It was agreed that different labs will continue to display variants in their reporting styles, although this does not preclude concomitant use of a communal standard.

Attendees discussed whether consortia should post raw or processed data. The advantage of making data available online is that the community can view and comment upon the processes of data collection. It was suggested that the consortia make available both minimally-processed and analyzed data, although the mechanism by which the data are posted requires discussion. One participant suggested that data that are processed in multiple steps should be posted in select steps.

The error models associated with processing proteomic data must be understood for the data to be useful. An objective understanding of associated error will enable database users to understand the data without overinterpreting them. To this end, it was suggested to provide a minimal level of filtering to prevent overinterpretation.

One participant responded that a list of candidate peptides or proteins may become a list of biologically-relevant proteins upon validation. Due to fragmentation variances, sample heterogeneity, and the variety of biomarkers associated with one cancer type, panels of biomarkers may become the true indicators of cancer detection. In this case, it will be necessary to determine the number of sera samples from different mice necessary for a marker to be defined as meriting further investigation. Moreover, standard nomenclature should be developed to distinguish between candidate markers (those not yet validated) and "true" biomarkers and enhance the public's understanding of this concept.

Another participant commented that GenBank entries were "owned" by their depositors, and comments added were attributed to the submitter. Thus, the consortia should allow users to add analyses to the consortia database, with conflicting results resolved and the resolution published. It was also suggested to have a separate data warehouse for processed data in addition to a repository for raw data. In summary, the consortia should provide both raw and processed data and an explanation of how conclusions were drawn from these data.

Return to Table of Contents

Next >



Home | View Agenda | Meeting Summary | Participant List

National Cancer Institute
National Institutes of Health
U.S. Department of Health and Human Services