![]() |
![]() |
| Meeting Summary | |||||
Keynote Address: Quality Control for Large, Distributed Data Collection Efforts: Lessons From the Human Genome Project A report issued by the National Research Council in 1988, Mapping and Sequencing the Human Genome (National Academy Press, Washington, D.C., 1988), provided a coherent policy framework for the HGP. The report specifically noted that a special effort should be organized and funded to create the genomic sequence map and that a diversified, sustained effort would be necessary to address technical issues. When this report was published, the total amount of sequence data in GENBANK was approximately 15.5 million base pairs (0.5% of the size of the human genome). The average length of the entries was 1064 base pairs (bp). The late 1990s witnessed an exponential growth in the number of sequences and base pairs downloaded into GENBANK, topping 28 billion bps by 2002. In contrast to efforts to map the human proteome, the technology base for the HGP proved to be relatively straightforward and stable. By 1990, it was clear that an automated, four-color-fluorescence-based implementation of Sanger dideoxy sequencing would be used. However, many incremental improvements in the technology that occurred in the 1990s were essential to ultimate success. These advances included cycle sequencing (1989), linear polyacrylamide techniques (1994), energy-transfer dyes (1995), mutant DNA polymerases (1995), and, most importantly, quality statistics ( e.g., phred, 1998). Dr. Olson noted that by the late 1990s, scale-up of the HGP was imminent, bringing quality control issues to the fore. The phred/phrap system to evaluate the quality of raw data for a sequencing trace (Ewing B and Green P. Genome Res 1998;8:186-194) provided the quality control tool necessary to evaluate data submitted to GENBANK. In a series of inter-center quality control exercises initiated by the National Human Genome Research Institute (NHGRI) in 1997, it quickly became apparent that the phred/phrap system provided an effective, easily adopted approach to quality control. By far the most important activity during these quality control exercises was the exchange of raw data between centers, with subsequent reanalysis by a center other than the data producer. By the time that data production scaled up steeply in 1999, there was a broad consensus that the quality control problem had been solved. As a result, final data quality in the April 2003 release of the human genome was excellent, demonstrating an error rate of approximately 10 -5 . Current quality-control issues involve second-order issues such as misassemblies and gaps in difficult-to-sequence regions and optimizing the tradeoff between quality and utility in sequences that are used primarily for comparison to a small number of gold-standard genomes. Dr. Olson also noted that the public-private competition to complete the human genome sequence strained basic scientific values. The rapid scientific progress, when combined with exuberant entrepreneurial capitalism, simultaneously created a temporary financial goldmine and misleading advertising about the benefits of solving the sequence as rapidly as possible. He noted that the intense social interest in HGP endeavors helped encourage a series of dynamics that hovers over all large-scale scientific endeavors carried out in the public eye. Dr. Olson commented that an irreducible amount of faith in one's colleagues is essential to maintain a balance for such projects; the ultimate QC issue for the scientific community is how it maintains its basic values in the face of intense social forces. Discussion: One audience member, noting that the characterization of the proteome parallels the HGP in terms of the speed at which proteins can be identified, asked about the trajectory of proteomic efforts relative to that observed with the genome. Dr. Olson noted that the key to rapid progress in proteomics will be the exchange of raw, unedited data between labs. Inter-lab cooperation will be the backbone of a successful proteomics initiative. Another attendee asked for Dr. Olson's thoughts on the role of the private sector in such an endeavor. Dr. Olson replied that this role will change on a case-by-case basis. He noted that the breakdown in the relationship between the public and private sectors in the HGP occurred because parties whose goals did not overlap were encouraged to work together. He noted that candid discourse is essential at an early stage to identify the areas of overlap and shared interests. Another participant inquired whether the genome results were over-hyped and how to balance the language of such projects to engender public acceptance. Dr. Olson reiterated the central importance of candor, noting that there is a tremendous temptation to oversell the benefits of a public project. He concluded by noting that society does support such efforts, even when it fails to pay much attention to them. Summary: Workshop Day 1 The consortia informatics groups summarized their development plans in light of the discussions on Day 1. Day 1 discussion leaders offered their reflections on the previous day's sessions, noting a positive trajectory toward identifying action items that will move this activity forward. Reiterating the need to provide users with the tools to extract meaningful information and results from the data, leaders commented that specific action items can be implemented in the next few weeks and months that will set guidelines that will extend beyond the parameters of the mouse model consortia. Dr. Downing also noted that the NCI will launch a website on its clinical proteomics projects and the two consortia on March 7.
|
|||||
|