Each cell in an organism contains its complete genome; but depending on the cell type, only the genes necessary for conducting the work of that cell type are expressed. The human genome sequence consists of an ordered listing of the adenine, cytosine, guanine, and thymine bases found on the 46 human chromosomes. Only about 1% of the genome sequence codes for proteins necessary for human life. Most of the rest of the genome consists of large repetitive noncoding regions whose function is not well understood. It is known, however, that critical clues to diseases such as cancer, diabetes, and osteoporosis lie in areas of the genome that do not code for protein. About one-fourth of the genome contains long, gene-free segments, whereas other regions contain much higher gene concentrations. The number of protein coding regions (genes) in the human genome, estimated at 30,000-40,000, is surprisingly small, given that the fruit fly has 13,000 genes and the thale cress plant has 26,000.15
The National Human Genome Research Institute (led by Francis Collins) and the Department of Energy's Human Genome Program (headed by Ari Patrinos) managed and coordinated the sequencing of the human genome program, initiated in 1990. A substantially complete version of the 2.9 billion base-pair human genome sequence was published in 2001.16 Reference 16 reports the results of an international collaboration to produce and make freely available a draft sequence of the human genome. The article presents an initial analysis of the data and describes some of the insights that can be gleaned from the sequence.
The Human Genome Project uses the shotgun sequencing method in which enzymes cut DNA into hundreds or thousands of random bits that are then sent to automated sequencing machines capable of handling DNA fragments up to 500 bases long. After sequencing, the fragments are pieced back together to become part of the sequenced genome. The shotgun approach is applied to cloned DNA fragments that have already been mapped; that is, the fragment's location on the genome is already known. By 2003, the Human Genome Project hopes to deliver a complete human genome sequence available to scientists in a freely accessible database. The National Institutes of Health website address for current human genome sequence data is http://www.ncbi.nlm.nih.gov/genome/guide/H_sapiens. html and the National Human Genome Research Institute's researcher resources website is at http://www.nhgri.nih.gov/Data/.
The Human Genome Project's goal is to produce high-quality, accurate, finished DNA sequences according to the following standards: (1) The DNA sequence is 99.99% accurate. (2) The sequence must be assembled; that is, the smaller lengths of sequenced DNA have been incorporated into much longer regions reflecting the original piece of genomic DNA. (3) The task must be affordable (the project funds technology development to reduce costs as much as possible). (4) The data must be accessible. To this end, verified DNA sequencing data are deposited in public databases on a daily basis. In the previous section, the Sanger and Maxam-Gilbert methods for DNA cleavage followed by gel electrophoresis for DNA sequencing was described. These so-called first-generation gel-based sequencing technologies can be used to sequence small regions of interest in the human genome, but these methods are too slow and too expensive for individual chromosomes let alone a complete genome. The Human Genome Project, in carrying out its goal of affordability, has focused on the development of automated sequencing technology that can accurately sequence 100,000 or more bases per day at a cost of less than $0.50 per base. Second-generation (interim) sequencing technologies focusing on important disease genes, for instance, use technologies such as (a) high-voltage capillary and ultrathin electrophoresis to increase fragment separation rate and (b) resonance ionization spectroscopy to detect stable isotope labels. Third-generation gel-less sequencing technologies aim to increase efficiency by several orders of magnitude. These developing technologies include (1) enhanced fluorescence detection of individual labeled bases in flow cytometry, (2) direct reading of the base sequence on a DNA strand using scanning tunneling or atomic force microscopies (described in Section 3.7.1, (3) enhanced mass spectrometric analysis of DNA sequences, and (4) sequencing by hybridization to short panels of nucleotides of known sequence.
Concurrently with work published by the Human Genome Project, a complete human genome sequence was reported by a consortium of 14 academic, nonprofit, and industrial research groups with the work coordinated by Celera Genomics.17 The following text is excerpted from the abstract of reference 17. In this work a 2.91-billion base-pair (bp) consensus sequence of the euchromatic portion (the portion containing genes) of the human genome was generated by the whole-genome shotgun sequencing method. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G + C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
Scientists will continue to use the information generated by the human genome sequencing publications to understand how genes function, how genetic variations predispose the organism to disease, and how gene function can be used in disease detection, prevention, and treatment regimens.
Was this article helpful?
Diabetes is a disease that affects the way your body uses food. Normally, your body converts sugars, starches and other foods into a form of sugar called glucose. Your body uses glucose for fuel. The cells receive the glucose through the bloodstream. They then use insulin a hormone made by the pancreas to absorb the glucose, convert it into energy, and either use it or store it for later use. Learn more...