![]() |
![]() |
| Meeting Summary | |||||
Session 2: Specimens, Experimental Annotations, and Data Quality Leaders discussed the specimen, experimental, and data analysis annotations required of the data repositories to be of use to the scientific community. Attendees discussed practical recommendations for generating informatics systems in the face of rapidly developing standards and responded to the use-case scenarios presented in the previous session. Dr. Carr: Guidelines for Publication of Peptide and Protein-Identification Data Speaking on behalf of the Molecular and Cellular Proteomics Working Group on Publication Guidelines, Dr. Carr noted that the dramatic increase in the number of large dataset papers being published has led to an inability to determine if results of peptide and protein identification are valid. Published studies often contain insufficient information for the reader to assess methods for data processing or protein identification criteria. Thus, it is likely that many incorrect interpretations are being published. The goals for these publication guidelines include:
He noted that finding a peptide match in a database is relatively easy, but knowing whether it is correct is not. It is always possible to match a tandem mass spectrum to a peptide in the database, yet incorrect matches often result from the use of low-quality peptide tandem mass spectrometric data to search the database. Most algorithms use a model based upon an empirical threshold that serves as a "cutoff" value. As such, each algorithm is associated with an unknown and variable false-positive error rate. While statistical methods to validate peptide assignments to tandem mass spectra have shown promising results, none is widely available or accepted at present. The guidelines proposed by the working group ( Mol Cell Proteomics 2004;3:531) include:
Dr. Carr noted that the proteomics community is data-starved, which inhibits refinement and comparison of new algorithms. Recognizing that integration and collective analysis are likely to yield new knowledge, Molecular and Cellular Proteomics strongly encourages submission of all tandem mass spectra mentioned in a paper as supplemental material. The journal is moving toward accepting and serving raw or minimally-processed intact liquid chromatographic/tandem mass spectrometric datasets. However, storage on journal websites is not a viable long-term solution, underscoring the need for creating public repositories. Recommendations to the mouse models consortia to handle data include:
Discussion: One participant inquired if the journal has asked submitters to include a set of standards with their data, and the answer is no. In parallel, however, people are providing sets of highly-curated tandem mass spectrometric data. Another attendee inquired if this effort reflects wider, community-based efforts and whether the stringency of the guidelines eliminates some biologically-valid identifications. Dr. Carr noted that the guidelines reflect realities by asking submitters to justify their results more stringently. He commented that a consensus view is the ultimate goal; if the community indicates that these guidelines are too stringent, then they will be modified. Another participant noted that attempts to decrease file space are limited and recommended that the journal request data in a certain format ( e.g., mzXML). Dr. Deutsch: Dr. Deutsch commented on specimen annotation, noting that the more complex the mechanism for annotating specimens, the richer the query selection can be, the less likely that the annotations will be completed, and the longer time required to develop a good interface. Microarray and mouse community databases are sources for specimen annotation guidelines. Points to consider include:
Discussion: One attendee commented that certain peptides in a protein are more likely to be identified, and confidence increases if the number of hits for an entry is high. Does such an observation impact the development of protein identification tools? Dr. Deutsch mentioned the Prototypic Peptide Predictor, a tool currently under development that will show the peptide within the protein and process all possible permutations to predict its likelihood of being identified. Another participant inquired about transforming datasets to mzXML and mzDATA. The need for generic schema was noted, and one participant commented that MAGE version 2 will have the capacity to describe specimens in any format. An attendee observed that writing a standard and convincing a community to use it are distinct challenges. Because most labs do not have the informatics resources to adopt state-of-the-art identification tools, data submission tools are critical. Dr. States: Dr. States began by noting that genomics had advantages over proteomics in terms of less tissue variation, one copy of each gene per genome, few sample handling issues, and simpler considerations regarding modification. Large-scale genomics efforts offer many lessons, including developing a framework for error identification, setting standards, and validating lab performance. Noting that applications drive accuracy requirements, Dr. States noted that the error rate falls as sequencing costs increase. He highlighted several quality assurance exercises from the HGP, including using a Cooperative Research and Development Agreement (CRADA) funding mechanism, blind resequencing of test samples, estimation of error rates only after completing a megabase of sequencing, and telescoping the eight sequencing labs into three centers that locked in the major technology choices. Regarding proteomic analyses, Dr. States commented that abundance is the single most likely predictor of whether a protein will be detected. Identifications may be highly significant even if they are not reproducible. Also, an observation must be reproducible within the original lab. Proteins can be identified at several levels, including member of a gene family, gene product, post-translational modification, transcriptional/splice variant, and complete covalent structure. Issues in project coordination include multiple permitted formats for data submissions to databases, choice of LIMS, division of responsibilities, data storage, and project coordination. For the Eastern Consortium, the c hoice of whether to implement a local LIMS and whether to use the NCI's caLIMS ( http://calims.nci.nih.gov/developers/ ) forms and interfaces within the lab is entirely up to the lab. caLIMS offers no explicit support for proteomics or genetics (it was designed for molecular biology), is generic, is integrated with caBIG, and can be adapted to the Mouse Models of Human Cancer Consortium or the Plasma Proteome Project. However, caLIMS data definitions provide a common vocabulary. Dr. States noted also the danger in imposing too much rigidity in quality control during the early stage of proteomic technology development. Although error processes and accuracy requirements need to be more carefully defined, informatics support in the labs is currently limited. He stressed also the need for project coordination. A division of labor between individual labs and the consortium data center, archiving of data at multiple levels ( e.g., raw, processed, analyzed), and the early and inclusive definition of variables will all enhance project progress. Discussion: Participants discussed issues related to the publication of proteomic data. It was noted that the literature is ambiguous because results being published are often derived from single experiments. Aggregate data sets across labs will help to make the associations derived from literature analysis much stronger. It was suggested that the number of fractions analyzed and the number of replicate runs per sample be included in publication submissions. It was also noted that complex mixtures will likely yield divergent and unusual results. Tools such as ProteinProphet were recommended to reduce protein identifications based on single peptides. It was agreed that different labs will continue to display variants in their reporting styles, although this does not preclude concomitant use of a communal standard. Attendees discussed whether consortia should post raw or processed data. The advantage of making data available online is that the community can view and comment upon the processes of data collection. It was suggested that the consortia make available both minimally-processed and analyzed data, although the mechanism by which the data are posted requires discussion. One participant suggested that data that are processed in multiple steps should be posted in select steps. The error models associated with processing proteomic data must be understood for the data to be useful. An objective understanding of associated error will enable database users to understand the data without overinterpreting them. To this end, it was suggested to provide a minimal level of filtering to prevent overinterpretation. One participant responded that a list of candidate peptides or proteins may become a list of biologically-relevant proteins upon validation. Due to fragmentation variances, sample heterogeneity, and the variety of biomarkers associated with one cancer type, panels of biomarkers may become the true indicators of cancer detection. In this case, it will be necessary to determine the number of sera samples from different mice necessary for a marker to be defined as meriting further investigation. Moreover, standard nomenclature should be developed to distinguish between candidate markers (those not yet validated) and "true" biomarkers and enhance the public's understanding of this concept. Another participant commented that GenBank entries were "owned" by their depositors, and comments added were attributed to the submitter. Thus, the consortia should allow users to add analyses to the consortia database, with conflicting results resolved and the resolution published. It was also suggested to have a separate data warehouse for processed data in addition to a repository for raw data. In summary, the consortia should provide both raw and processed data and an explanation of how conclusions were drawn from these data.
|
|||||
|