Proteomic Technologies Informatics Workshop - February 8-9, 2005 - The Fairmont Olympic Hotel,  Seattle, WA
View AgendaMeeting SummaryParticipant List
 
Meeting Summary Microsoft Word Document Link Download and Print Meeting Summary (Microsoft Word - Size: 290 kb)
Return to Table of Contents

Next >

Session 3: Lessons and Challenges of Building Data Repositories
Discussion Leaders:
Kenneth H. Buetow, Ph.D., National Cancer Institute
Ronald Beavis, Ph.D.,Beavis Informatics, Ltd.
Mark Igra, Fred Hutchinson Cancer Research Center

Leaders presented their experiences in developing other data repositories and discussed anticipated challenges for developing proteomics repositories in the current environment of rapidly changing technology and immature standards.

Dr. Buetow:

Dr. Buetow began by discussing ways that resource-development experiences with diverse communities ( e.g., the human gene mapping community, caBIG, MMHCC) have contributed to lessons learned, the most basic of which is to understand the scope of the problem that the community is attempting to solve ( e.g., goals, needs, users). NCI biomedical informatics initiatives have a goal of creating a virtual web of interconnected data, individuals, and organizations that redefines how research is conducted, care is provided, and patients/participants interact with biomedical research enterprise. caBIG ( www.cabig.nci.nih.gov ) is an initiative to create a useful tool based on this goal that attempts to cover the watershed of the cancer enterprise. It is being piloted through base agreements in 45 NCI Cancer Centers that have agreed to caBIG principles. caBIG is "open" in many ways, including open source code, open access, data sharing, and "do no harm" licenses. With an understanding that tomorrow's tools will likely be different from those used today, processes are dynamic and evolutionary; an infrastructure must be designed to facilitate rapid exploration of new methods. caBIG is based around smaller, component-based software applications that can "plug-and-play" into new complex structures. Focus areas include boundaries, interfaces, and the metadata infrastructure that joins components, with the shape of boundaries defined by application program interfaces (APIs).

caBIG focuses on standards rather than standardization; data standards are developed to be used as exchange or submission formats. These standards cannot be proprietary and are developed "just in time" as solutions to real, practical problems. The caCORE (cancer Common Ontologic Representation Environment) is comprised of biomedical information objects (to allow extraction of data from their representations in databases and provide conceptual representations so that groups can agree to a common mapping), common data elements (CDEs; structured data reporting elements), and a controlled vocabulary (through the NCI Thesaurus and NCI Meta-Thesaurus).

Standards that support this infrastructure include Enterprise Vocabulary Services (EVS; a toolkit of browsers and APIs), cancer Bioinformatics Infrastructure Objects (caBIO; applications and APIs), the cancer Data Standards Repository (caDSR), and a caCORE software development toolkit. caBIG employs a compatibility matrix that indicates the varying levels of compatibility of a particular system with the grid.

Another lesson learned from previous communal efforts is that quality measures are transforming. Objective measures are critical and should track with both the qualitative and quantitative data. Experimental inputs can be as critical and important as outputs, even though the ultimate use cases may be unclear at present. caBIG has a series of resources and pilot projects that will be online in 2005, including the Tissue Banks and Pathology Tools Workspace (TBPTW) and the Integrated Cancer Research pilot. Community members are encouraged to participate in caBIG activities, submit tools and data infrastructures to caBIG repositories, and work toward making individual applications and solutions caBIG compatible.

Discussion:

One participant asked about proteomic applications for caBIG. Dr. Buetow noted that an interest group is currently working on Proteomics LIMS (estimated deployment: the 3 rd quarter of 2005) and also a general-purpose XML system. He noted that caBIG is a federated infrastructure, and anyone may contribute. Another participant noted that community input will help to formulate the shape and capabilities of caBIG.

Another attendee inquired about plans to curate data that are inside repositories that will be integrated with the grid. Dr. Buetow responded that caBIG will attempt to integrate datasets that are identified by the caBIG community as important, as well as new databases when identified.

Dr. Beavis:

Dr. Beavis contextualized his presentation with a quote from Eric Steven Raymond ( The Cathedral and the Bazaar ): "Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away." Based on the design of databases such as MIAPE and RADARS, peaks generated from mass spectra account for the vast majority of the difficulties in use. He thus suggested the following principles of database design:

  • Restrict the amount of spectrometric data stored in repositories only to those data necessary to support conclusions
  • Accept metadata storage and use a structured data format ( e.g., XML) to create a rational, simplified relational database design
  • Utilize XML structure to retain object relationships
  • Design a relational database for queries that can rapidly access the XML information
  • Utilize external resources and do not attempt to create a database that holds all knowledge

Dr. Beavis then discussed the Global Proteome Machine Database (GPMDB) design, which represents the minimum number of tables necessary. XML contains the search parameters, statistics, and other detailed hierarchical structures of ways to put amino acids into domains to identify a protein. Such a design is easier to build and query than are larger, more annotated databases. GPMDB has 5.1 M annotations, and robots troll through the data regularly and highlight outliers. The database includes publicly-available data plus that which is contributed by the public. For a particular protein, a series of mass spectra can be evaluated and compared.

Discussion:

One participant inquired about the minimum data necessary to support conclusions, and Dr. Beavis noted that adding tables into a database is easier than removing them, so the user must decide upfront about desired conclusions. Another participant inquired about the capabilities to analyze differential display data quantitatively and semi-quantitatively. Dr. Beavis noted that, because of the variety in quantitation strategies, it will be best to decide on a method first.

Dr. Igra:

Dr. Igra discussed the repository development strategy at the FHCRC, noting that current capabilities include tracking mice and samples and storage and analysis of tandem mass spectra. Goals include usability, the ability to incorporate experiments and samples from many labs, and helping to establish a widely-used standard. He then contextualized the issues in terms of the development of the World Wide Web and Linux, which were successful due to low barriers to entry, an "evolvable" structure, and widespread utility. The strategy used by the FHCRC was to start with an extensible annotations framework and web ontology language that assigns and reads a unique identifier for any particular item. Tools for annotation are being developed ( e.g., sample annotators, experiment annotators, and systems customized for each lab) using an evolutionary model that is both open-source and open-process. The system is being designed for facile community participation and practical use. The object model and other components will be provided to the community.

Discussion:

One participant commented that users may be interested in an attainable "choice" protein as a test case, rather than a common "ocean" protein, as displayed in this presentation. Dr. Igra suggested querying those low-abundance proteins that are of most interest to workshop participants. Bioinformaticians could create a suggested list of proteins to analyze, and the biologists can add value to the quantitative information by contextualizing the relevance.

Another participant noted that establishing close ties between software writers and system users will facilitate the development process and create useful products. Another attendee noted that proteomic technologies develop faster than LIMS systems, making open-source development critical. Also, it is important to put constraints on the system. While input from biologists is critical for interfaces, mass spectrometrists, users, and biologists must communicate to make resources work effectively. Participants agreed that a team approach is necessary; biologists and informatics personnel must collaborate to tie results to relevance. Another attendee highlighted the Molecular Alterations in Breast Cancer initiative, designed to capture all data relative to the specific disease ( e.g., heterogeneous data from studies and patients, polymorphisms, epigenetic alterations). While the database design for such an undertaking is relatively trivial, a shared, concrete vision is necessary upfront to harness the data effectively.

Return to Table of Contents

Next >



Home | View Agenda | Meeting Summary | Participant List

National Cancer Institute
National Institutes of Health
U.S. Department of Health and Human Services