One of the major areas of progress during the past decade has been the generation of an enormous amount of DNA sequence generation. For an ever increasing number of organisms a complete genome sequence is now known. The sequence of several bacteria, including Haemophilus influenzae, Mycoplasma genitalium, the archaebacterium Methanococcus jannaschii, and the cyanobacterium Synechocystis sp. PCC 6803 has been completed in the mid nineties. The Escherichia coli genome was completed in 1997. Since then the sequence of hundreds of other prokaryotes has been completed or is in progress. See http://img.jgi.doe.gov, http://genome.jgi-psf.org/mic_cur1.htmland http://www.tigr.org/tdb/mdb/mdbcomplete.html for a summary of microbial genome projects. The Institute for Genomic Research (TIGR) in Maryland and the Kazusa DNA Research Institute in Japan played a major role in early sequencing efforts. Yeast has the distinction of being the first eukaryote of which the complete genome sequence has been finished (in 1996). A large number of microbial and other genome sequences are now generated at the DOE-funded Joint Genome Institute (JGI; http://genome.jgi-psf.org/mic_home.html) in Walnut Creek CA. With improved techniques and increasing capacity and accuracy of automated sequencers, bacterial genomes now can be sequenced within a few days if one has a sufficiently large number of automated sequencers lined up. However, any of these sequencing projects are about three orders of magnitude smaller than the Human Genome Project.
The first eukaryote with a larger genome that had its genome sequence completed was a nematode (worm) by the name of Caenorhabditis elegans (C. elegans for short). The nematode genome is about 97 million base pairs long, contains about 20,000 genes, and sequencing was essentially completed in 1998. Groups involved in the sequencing effort were at the Sanger Center near Cambridge (UK) and at Washington University (St Louis, MO), and therefore this sequencing effort was done in an academic (rather than industrial) setting. As a demonstration project to illustrate the feasibility of their sequencing approach for complex genomes, the Perkin-Elmer/Venter initiative (in collaboration with scientists at UC-Berkeley) sequenced the genome of Drosophila, the fruitfly. The Drosophila genome is about 4% of the size of the human one. The project uses 230 automated sequencing machines working in parallel to collectively churn out 100 million base pairs per day. The initiative started in April 1999, and in August 2000 most of the 120 million nucleotides long Drosophila sequence had been determined. However, the sequence is still being annotated and remaining gaps in the sequence are still being closed several years later (see http://www.fruitfly.org). This illustrates the difficulty to generate a full genomic sequence for eukaryotes. In addition to determination of the human genome sequence, genomes of other mammals (mouse, rat) are also being sequenced. The first plant with a virtually complete genome sequence is Arabidopsis thaliana (about 125 million nucleotides) (http://www.arabidopsis.org). The first agronomically important plant with a sequenced genome is rice (about 400 million nucleotides); a draft sequence of the genome of two rice varieties was published in 2002, and a more complete sequence came out in 2005. The first fish with a completed genome sequence was the puffer fish (2002) (http://genome.jgi-psf.org/fugu6/fugu6.home.html), with the zebra fish second (http://www.sanger.ac.uk/Projects/D_rerio). A web site with links to several genome projects is http://www.ncbi.nlm.nih.gov/Genomes/index.html. Recent news on genome projects and related issues can be found at http://fullcoverage.yahoo.com/fc/Science/Biotechnology_and_Genetics.
The results of the smaller sequencing efforts have made it clear that there is much to be learned from a chromosome or genome sequence. Very importantly and unexpectedly, a large number of long open reading frames identified in the genome sequences does not have known homologues with similar sequence in other organisms. This is new information, and the function of the products of these putative genes remains to be established. In the case of yeast and Synechocystis sp. PCC 6803, this can be done relatively easily by making targeted mutations in these putative genes, and analyzing the effect of these introduced mutations. For other organisms, less direct but nonetheless fairly effective methods (for example, RNAi) to "knock out" genes have been developed. This will provide information on what type of protein that particular gene is coding for. In the case of humans, however, the introduction of targeted mutations is not desirable.The Human Genome Project
The entire human genome is about 3 billion nucleotides long. In the mid-nineties, sequencing cost less than $1 per nucleotide (it currently is much cheaper than that), and the idea came up to find funds to sequence the entire human genome. Certainly an ambitious project, but why not? The project, called the human genome project (HGP), had an unusual origin. It was not initiated by a committee of molecular geneticists, or by the major biomedical funding agency, NIH. Instead, it was proposed by an administrator in the Department of Energy (DOE), convinced that the powerful tools of molecular biology made it appropriate to introduce centrally administered "big science" into biomedical research. Later on, the National Institute of Health (NIH) jumped on the HGP bandwagon as well. Before long the Human Genome Project had become an international venture, with participation by Japan, England, Italy, and other countries in the Human Genome Organization (HUGO). HUGO serves as an international umbrella for information exchange and collaborations. An extensive web site on the Human Genome project is found at http://www.ornl.gov/TechResources/Human_Genome/home.html, and (the comparatively much less informative) HUGO's web site is at http://www.hugo-international.org/.
The Human Genome Project was started up with James Watson (from the Nobel-prize winning Watson & Crick double-helix) as director. Dr. Watson resigned in 1992, but his initial leadership provided the program with sufficient impetus that it had gone beyond the "point-of-no-return". The project was planned to take 15 years (completion in 2005), and initially progress was slow. The initial phase was to provide a physical and linkage map of all human chromosomes, providing the genetic location of diseases with a genetic basis on one of the chromosomes. For this purpose, thousands of "probes" (relatively small pieces of expressed DNA of 1,000-5,000 nucleotides in length, and with a unique sequence) that are spaced at regular intervals of about 100,000 nucleotides were identified. This information already had a medical payoff, in that genetic screening for a particular disease became much easier (see below). After making a genetic map of all chromosomes, brute-force sequencing was applied and all nucleotide sequences were put into huge databases in which sequences were put together and in which introns, open reading frames, sequence repeats, etc. were directly analyzed.
The publicly funded Human Genome Project, which aimed at a complete sequence by 2005 and which appeared behind schedule, received a significant boost (or kick in the behind?) from a private effort spearheaded by Perkin-Elmer Inc. (the main DNA sequencer manufacturer) and J. Craig Venter, the former president and director of The Institute for Genomic Research (TIGR) who subsequently headed the Perkin Elmer sequencing subsidiary, Celera Genomics. This private effort aimed at sequencing the entire human genome by 2001 at a cost of $150-300 million using a shotgun cloning and sequencing approach they pioneered for bacterial genomes. The publicly funded Human Genome Project was supposed to cost 10 times more. The bottom line is that the Human Genome Project and Celera Genomics made a joint announcement in June 2000 that a working draft of the DNA sequence of euchromatin regions (containing most genes) of the human genome has been developed. At that time, about 65% of the genome had been sequenced. . . well ahead of the initial schedule. The remainder of the sequence has now been determined and assembled, except for sequence repeat regions and other regions that are hard to interpret and assemble. Attention has now turned to finding where genes lie in the DNA sequence and what functions those genes control. This information can be used to fashion better medical treatments and to shape the direction of development for new drugs.
A main difference between publicly and privately funded genome projects may be timely release of sequence data. Perkin-Elmer/Celera can sell access to annotated databases as well as patent rights of genes to pharmaceutical companies and biotech firms. However, raw sequence data are made available for free.
Initially, the human genome was thought to contain up to 100,000 genes, occupying about 3% of the total genome. The function of much of the remainder of the genetic material is as yet unclear. With the majority of the human genome sequence in hand, it is clear that the number of genes has been overestimated by about a factor of 3, and that humans (with their 30,000 genes or so) have just 2-fold more genes than a simple plant, worm, or fruitfly, and just 10-fold more genes than most bacteria. The human gene map and other information can be found at http://www.ncbi.nlm.nih.gov/genome/guide/.
The human genome sequence, written down as a "word" of over 3 billion letters, fills the equivalent of 134 sets of the complete volumes of the Encyclopaedia Brittanica.
A company as well as a federal government agency (NIH) have attempted to patent gene sequences of unknown function; this was not accepted by the patent offices to be patentable. However, even though random sequences are not patentable, it is possible that complete genes can be patented for direct use (for example, for a testkit). But even if patenting is not straightforward, companies where new sequence is generated (or who contract with sequencing companies), will be able to capitalize on this information before anyone else learns of its existence. Therefore, the most justifiable and ethically correct attitude in this respect may be to have all sequence information enter the public domain expediently, and commercial enterprises who wish to capitalize on this information may do so if desired. This will avoid secrecy in data gathering (which would lead to unnecessary duplication of efforts), and maximize openness.
When sequencing large stretches of DNA with hitherto unknown sequence, it is of utmost importance to be highly accurate in the sequence analysis. Omission of a single nucleotide in a sequenced gene will shift the reading frame, and will result in an entirely erroneous derived amino acid sequence behind the location of the sequence analysis error. Currently, a 99.97% accuracy is quoted in sequence determinations. Even though this suggests that DNA sequencing is extremely precise (which it is), it should be kept in mind that there may be 3 errors in every stretch of 10,000 nucleotides of determined sequence. The main source of errors is the omission or addition of a single nucleotide, and therefore there may be a reading-frame shift every 3,000 nucleotides or so (on average). Keeping in mind that genes generally are somewhere between 100 and 10,000 nucleotides in length, it is likely that a significant percentage of gene sequences that have been determined carry a frame shift. Therefore, effort is being made to further improve the DNA sequencing accuracy in genome projects. However, with the genome sequence being determined for a large number of different organisms, it becomes much easier to spot regions that are expected to contain frame shifts (how??).
Use of genomic data
The amount of information that results from genomic sequencing projects is stunning and overwhelming. In the case of eukaryotes containing introns in their genes, a first challenge is to accurately assign introns so that coding sequences can be correctly predicted. In most organisms this is not trivial. Part of the assignment depends on monitoring codon usage: most organisms have some codons that are not or rarely used, and for the third nucleotide of a codon (remember that there is sequence degeneracy at mainly the third position for many codons!) generally there is a clear preference depending on the organism.
Prokaryotic assignments usually are simpler because of the lack of introns and because of the smaller genome. However, the start codon for the start of translation in prokaryotes is not always AUG (as it is in eukaryotes) but can also be GUG, CUG, or UUG, and, at lower frequency, selected other codons. Therefore, the search for start codons is a little more complex than in eukaryotes.
Once a genomic sequence has been determined and open reading frames (possibly translated regions, from start codon to stop codon) have been defined, then one can try to assign a function to the proteins coded for by the various open reading frames. The assignment of a function often involves comparing the sequence with sequences of known proteins from other organisms. A useful software program in this respect is BLAST (http://www.ncbi.nlm.nih.gov/BLAST/; also see the laboratory section of this course). However, there are many open reading frames that do not have known homologues in other organisms, and in such cases one will need to resort to experiments such as deletion mutagenesis of the open reading frame in an appropriate organism, etc.
Once a genome sequence is known, it becomes possible to do genomic expression analysis, and determine which genes are expressed under which conditions. This information may also provide clues regarding the function of open reading frames coding for unknown proteins. Methods have been developed to obtain comprehensive insight in the expression of most genes in an organism. These methods generally utilize a DNA chip (also known as gene chip or microarray), in which a collection of hundreds or thousands of genes (cDNAs) or gene fragments (oligonucleotides of 25-70 bases long) from one organism has been spotted on a microscope slide. After isolating mRNA from this same organism (perhaps from a particular tissue or when grown under specific conditions), one can do a reverse transcription of this mRNA to complementary DNA (cDNA), incorporating a fluorescent label in the cDNA. The cDNA is then hybridized to the blot or microarray, and the relative level of expression for each gene can be determined in a single experiment. The drawback of this genomic analysis using microarrays is that this requires expensive chemicals and equipment; Affymetrix (http://www.affymetrix.com) and Agilent (http://www.chem.agilent.com/Scripts/PCol.asp?lPage=494) are are leaders in this field.
Gene chips provide information about transcript levels. While this information is important and informative, transcript levels do not necessarily correspond to the amount of protein coded for by these transcripts. As protein ultimately is the material that is functionally relevant (enzyme activity, structural function, etc.), there is much interest in developing ways to determine the level of many different proteins in a tissue or cell. Antibody arrays have been developed, in which a large number of different antibodies have been attached to different locations in an array. After extraction of proteins from a tissue and labeling them with a fluorescent marker, the mixture can be incubated with an antibody array, and the relative intensity of the spots on the array (each representing the relative amount of crossreaction with a specific antibody) can be determined. An example of such an antibody array is found at http://www.clontech.com/clontech/products/families/abarray/index.shtml.Genetic disorders
A number of diseases are linked to a defect in a specific gene. In such cases, it is important to find the faulty gene: then treatment may become easier and more direct (for example, by regular injections with the gene product in the case of insufficient gene expression or of a mutation within the coding region, as now is done to treat diabetes). Mapping of a certain disease to a defect in a specific region of the genome used to be a huge job. For example, to map and sequence the gene for cystic fibrosis took six years of research by a large number of groups, at an expense of approximately $150 million (5% of the cost of the entire human genome project). With a genomic sequence in hand, a genetic disease is mapped, narrowing down the disease locus to 10-100 genes. Based on predicted gene function, the locus may be narrowed down further, and candidate genes may be sequenced from patients and healthy individuals. Gene localization may now be accomplished in a matter of months, at a fraction of the cost of gene localization in the pre-genomic era.
In many cases, genetic disorders are due to single nucleotide mutations. However, in some cases small genome rearrangements may have taken place, affecting the expression of specific genes. These rearrangements may not have led to changes in gene sequence. However, they can be readily determined by Southern blotting. Total genomic DNA (obtained from readily available cells, such as white blood cells) is chopped up using restriction endonucleases (recognizing specific 4-8 bp sequences in the DNA) into fragments of variable length that are separated by electrophoresis. The DNA fragments are transferred to nylon membrane, and probed with the gene or DNA region of interest. If there were major rearrangements or sequence alterations in the region of the gene in the patient, then the length of the fragment(s) obtained from patients with the disease may differ from the length obtained from unaffected individuals.
To test for single-base mutations, the simplest method is SNP (single nucleotide polymorphism) analysis using a microarray carrying oligonucleotides with similar but non-identical sequences. Comparing relative hybridization intensities, the sequence at specific loci can be predicted reliably without the need for large-scale sequencing. Identified regions may be sequenced to confirm the conclusion from SNP array analysis.
The reason for the large interest of companies in SNP analysis is obvious: A collection of thousands of relatively frequently occurring SNPs linked to genetic disorders may be put on a single DNA chip, and people may be screened for any mutations that may potentially impact their health. For a healthy individual, the presence of mutations that are linked to an increased occurrence of a particular disease may lead to increased vigilance in this respect, whereas for a patient the reason for particular symptoms may be diagnosed expediently.Restriction fragment length polymorphism (RFLP)
The genome sequence of each individual is different. Also, within one individual the two homologous chromosomes (one originating from each parent) have differences. On the average, 1 out of 100-500 nucleotides is different between two individuals. Of course, the number of differences is smaller when individuals are closely related. In many cases, these genetic variations between individuals are neutral: one variant is neither more helpful nor more harmful for survival and reproduction than the other. These variations usually occur in regions of the genome that do not encode for structural genes. They are more likely to be found in introns, pseudogenes, and sequences between genes of no known function than in sequences that are known to encode proteins. The genotypic variations that do not have phenotypic effects are called silent mutations.
The inter- and intraindividual sequence variations often are concentrated in some regions of the genome, and are less frequent in others. Thus, in certain genome regions sequences are variable, and virtually any individual may have their own, unique pair of DNA sequences for this region. This property is taken advantage of by the technique of restriction fragment length polymorphism (RFLP) mapping. A specific sequence may contain a restriction site for a certain enzyme in one of the chromosomes of one individual, but the site may be absent in another person. Thus, when using a probe of cloned DNA around this sequence, the restriction fragment lengths from genomic DNA (on a Southern blot) hybridizing to the probe may be different in different persons. Using different restriction enzymes, individuals that have a similar pattern with one restriction enzyme may have different patterns with another enzyme. Thus, it is possible to obtain a "DNA fingerprint" that is different for each individual, provided that sufficient probes for different polymorphic DNA regions are used, and that a number of different restriction enzymes have been applied.
In addition to obvious advantages of RFLP mapping for forensic applications (see a subsequent section), RFLPs are extremely useful as markers to localize disease-related genes on specific chromosomes. For example, if in one family a certain disease is found to be linked to the occurrence of a specific restriction fragment length in a variable region in different siblings, whereas other family members who do not have the disease do not show that RFLP pattern, it is likely that the RFLP locus and the locus where the disease-causing allele is located are linked. Thus, if it is known where that particular RFLP region is, then it is known where (approximately) to find the mutated gene. A RFLP probe repository has been established to facilitate the exchange of probes between investigators, so that it becomes possible to establish relationships between markers.
One of the first diseases localized by genetic linkage analysis using RFLPs was Huntington's disease. This disease is genetically dominant, shows up at 35-40 years of age, and causes a progressive degeneration of the central nervous system. Death follows within several years after the first symptoms occur. As this disease is dominant, it is simple to follow. For recessive disorders, the use of RFLP to locate genes associated with the genetic disorder is less trivial, since it may not be known until several generations later who is a carrier of the disease (heterozygote without disease symptoms) and who is not. To map a recessive disease to a specific chromosome and location, it often is necessary to use families in which at least two siblings with the disease and healthy parents are available for study. All affected siblings within one family will have the same pair of disease-causing alleles, and will share RFLP patterns for probes linked to the gene associated with the disease. Healthy siblings and parents will not have the same RFLP pattern as siblings afflicted with the disease. This approach has been successful in locating the cystic fibrosis locus to the long arm of chromosome 7, and in eventually cloning and sequencing the gene that is altered in cystic fibrosis patients. The genetic defects that are the causes for retinoblastoma and for sickle-cell anemia (the latter being a mutation in the globin gene) have been mapped by this methodology to chromosomes 13 and 11, respectively. Many other diseases have been mapped as well (see http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91).
A complication (but at the same time a helpful feature) of genome mapping is recombination between chromosomes during meiosis. This phenomenon is known as crossing over. If the RFLP region and the locus of the genetic defect are close together, the probability of crossing over in between the two regions is low. However, if the distance is larger, the probability of recombination in between the two regions becomes larger as well. A longer distance (and thus more frequent crossing over between the RFLP marker and the gene of interest) complicates the interpretation of linkage studies. However, on the positive side, one can also calculate the approximate distance between the marker and the gene from the frequency that a certain RFLP pattern does not co-inherit with a certain disease.
Identifying a gene once the locus has been "mapped" is not trivial, but the availability of the genome sequence is a great help. For example, a linkage between a certain RFLP pattern and a certain allele that has a chance of more than 95% to remain linked upon meiosis merely means that the allele and the RFLP region are not more than some 5 million nucleotides from each other. With a high map density of RFLPs a still closer marker may be found, but as the markers get closer to each other, recombination between them will be observed so rarely that it will be impossible to determine with a reasonable degree of confidence where the gene of interest is located with respect to the RFLP markers. However, on average there are not more than 10-100 genes in a 5 million nucleotide span on the human genome, and when a disease has been mapped to such a region, one can first study the predicted function of the genes in this region and see whether their impairment might lead to symptoms found in the disease. If so, such genes (and their surrounding regions) may be analyzed first. If no obvious candidate genes are identified, then expression of genes in the mapped region may be monitored, and the reason for possibly aberrant expression patterns for a specific gene may be identified.
In a number of cases a disease has been found to be linked to the sex of the individual. For example, hemophilia, red-green colorblindness, and Duchenne Muscular Dystrophy almost exclusively occur in males. If such a strong sex-linkage is found, it is evidence that the gene related to the disease is on the X chromosome: as males have only one X chromosome, X-linked traits that are recessive in females are essentially dominant in males. This sex linkage in some cases has been known for a long time. A good example in this respect is hemophilia, a deficiency in clotting factor VIII, which impairs healing of wounds: the Talmud, written about 1,500 years ago summarizing oral laws of Jewish religion, tells of a Rabbi instructing a woman not to have her son circumcised after three of her sister's sons had bled to death. But no exception would be granted if her brother's sons (rather than her sister's) had met a similar fate. Explain why.
It should be kept in mind that a number of diseases are multi-factorial: they are caused by a combination of factors, none of which by itself would cause the disease. This complicates matters greatly, in that it is very difficult to track such a disease genetically. On the other hand, diseases with very similar manifestations may be caused by mutations in different genes. This again complicates identification of genetic changes correlated with such a disease. Explain why.
A map has been compiled of the human chromosomes with about 11,000 approximate location(s) of mutations or loci linked to genetic diseases. This map (OMIM: Online Mendelian Inheritance in Man) can be found at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM.
Genetic screening for common diseases already is a common part of obstetrical care and a standard procedure for newborns. Also, in cases where there are major medical concerns regarding the health of an unborn baby on the basis of genetic diseases occurring in the family of either parent, specific tests can be carried out on the genetic information carried by the foetus. In many cases, cells from the amniotic fluid that bathes the foetus are used as the test material for either biochemical or genetic screens. Similar information also can be obtained by chorionic villus sampling. In this procedure, the doctor removes a piece of tissue from the developing placenta for study; this procedure can be carried out earlier in pregnancy. Also pre-conception counseling and screening can be carried out: for example, in some communities of Hasidic Jews, couples are screened before marriage to detect whether both carry one defect gene for hexosaminidase A. If so, there is a 25% chance that a child will not be able to produce this enzyme, and will suffer of Tay-Sachs disease. Children who inherit two impaired hexosaminidase genes suffer severe neurological effects and typically die before age four.
The occurrence of relatively rare recessive diseases may be much more frequent in communities that traditionally have been "self-supporting" in terms of finding a partner in life. Inbreeding significantly increases the probability with which recessive diseases surface. In this respect Tay-Sachs disease already has been mentioned, which is much more frequent in Hasidic Jews than in other populations. Another example is the Ellis-Van Creveld syndrome ("six-fingered dwarfism"), of which 50 cases are known among the Amish population in Lancaster County PA. Worldwide, only another 50 cases of this disease have been recorded. A little closer to home is relatively high occurrence (0.5%) of albinism among the Hopi Indians. This albinism, due to tyrosinase deficiency that causes a lack of melanin (pigment) synthesis, occurs at a 0.002% rate worldwide.
Routine genetic screening can be as simple as counting chromosomes (an extra chromosome 21 is indicative of Down's syndrome). If specific genetic diseases run in a family, then genetic screening can involve RFLP probing related to genes that are associated with the disease (thalassemia, sickle-cell anemia, hemophilia, or cystic fibrosis, to name a few). More than 4,000 inherited diseases are due to single-gene defects. The RFLP restriction pattern of the foetus (or young person) being screened is then compared to that of various family members, some with the disease and some without. From this comparison, it is usually possible to determine whether the foetus or young person will develop the disease or not. This information is advantageous for early diagnosis of a disease. In many cases an effective treatment of the disease is possible if diagnosed at an early stage, often even earlier than when the first symptoms develop. A good example for this is phenylketonuria (PKU), which is due to a lack of phenylalanine hydroxylase, which converts the amino acid phenylalanine to the amino acid tyrosine. Absence of a functional enzyme leads to accumulation of phenylalanine and toxic derivatives. The latter result in seizures, mental retardation, and a decreased life span. Blood of infants usually is tested for unusually high levels of phenylalanine. If a high phenylalanine level is found, this may indicate phenylketonuria. The PKU symptoms and damage can be avoided by having a phenylalanine-free diet, particularly during the first 15 years of life. PKU patients should also avoid aspartame (NutraSweet), as this is a phenylalanine derivative.
Results of simple genetic and/or biochemical screening also can provide information regarding someone's ancestry. Most examples to date relate to biochemical blood analysis. For example, essentially all American Indians carry the so-called Rhesus factor (which protein got its name by reacting with an antiserum from Rhesus monkeys). Virtually the entire indigenous population of East Asia does so too, thus confirming the hypothesis of common ancestry of the two groups. Absence of the Rhesus factor is quite common among blacks and whites. Tribes of Native Americans, such as the Sioux, that have mixed more with whites and blacks have a larger proportion of rhesus-negative individuals (not carrying the rhesus factor) than tribes that have remained relatively "pure" for a longer time (such as the Navajo). As another example, whites can gauge their heritage by their blood group: blood group B is thought to have entered Europe with the invasion of the Tatars from Asia. Thus, if someone has blood group B (or AB), this most likely means the individual has some Tatar ancestry. The reverse conclusion (blood group O or A, thus no Tatar ancestry) is not justified; explain why. Similar population-genetic tests can be done by RFLP mapping.
Most intriguing, and potentially most troubling, are genetic tests that peer well into the future. Huntington's disease, for example, typically appears after age 30; a person with an affected parent stands a 50% chance of developing the fatal disorder. Being tested for the single deadly gene can bring a powerful surge of relief, or the certainty of progressive mental and physical decline (no treatment is yet available). For some inherited adult-onset disorders, however, advance warning can serve to influence the course of the disease. Adult polycystic kidney disease invariably leads to kidney failure, but detecting concomitant high blood pressure early and effectively controlling it may delay the need for chronic kidney dialysis. Tests for susceptibility to certain malignancies, such as retinoblastoma (an eye cancer already occurring during infancy and childhood), or familial intestinal polyposis (which usually progresses to colon cancer) permit stepped-up vigilance and early treatment.
Also, in 1993 two genes (msh2 and mlh1) have been identified that predispose people to non-polyposis colon cancer; this type of cancer strikes one in 20 people, and in about 20% of those cases the cancer is linked to specific mutations in the msh2 or mlh1 genes. About 10 companies already have purchased the rights to develop msh2 and mlh1 tests, and thus have staked their claim in this huge market. Another gene, brca1, has been linked to breast- and ovarian cancer. For this another presymptomatic gene test with a huge market has been developed. However, a number of important questions should be asked regarding ethical implications of these gene tests: Is it ethical to test for diseases for which there are no known cures? How reliable are the available tests? What are the psychological consequences for healthy patients of learning their possible destiny? Is the regulation of laboratories that offer genetic testing stringent enough to ensure that life-shattering errors are not made? How can perfectly healthy people who may carry a defective gene be protected from discrimination by health and life insurance companies and potential employers?
One key issue in this respect is whether the knowledge provided by gene testing will actually save lives. For Huntington's disease, the answer is clearly no as no cure is available. For cancers, the answer is not clear. In general, early detection of cancers is associated with improved survival. But the question is whether interventions that work for the general population are adequate for individuals with a strong genetic risk. For example, mammograms that are used to detect breast cancer early may not be good for people with brca1 mutations, as low doses of radiation might conceivably trigger further mutations that could lead to cancer.
The approval mechanism for new genetic testing materials involves the Food and Drug Administration (http://www.fda.gov). They were specifically charged with this task through a recommendation (http://www.hhs.gov/news/press/2001pres/01fsgenetictests.html) by the Secretary's Advisory Committee on Genetic Testing (currently subsumed in the Secretary’s Advisory Committee on Genetics, Health and Society) at NIH (http://www4.od.nih.gov/oba/sacghs.htm). This website has interesting documents on genetic discrimination. More genetic testing information is available at http://www.ornl.gov/sci/techresources/Human_Genome/medicine/genetest.shtml. The dilemma on whether or not to participate in genetic testing has been the topic of discussion on National Public Radio in 2004 (http://www.npr.org/programs/watc/features/2004/may/genes/index.html).Three percent of the Human Genome Project budget is set aside to address ethical considerations. In particular, use of genetic screening by insurance companies or prospective employers are areas of considerable concern. An overview of the different aspects of the work done on the Ethical, Legal and Social Issues (ELSI) of the Human Genome Project is provided at http://www.ornl.gov/hgmis/elsi/elsi.html. This page is appropriately headed with the title "Societal Concerns Arising from the New Genetics", and refers to very useful and informative pages on genetics privacy and legislation, gene testing, gene therapy, behavioral genetics, patenting, and forensics at the same site.
Each individual carries a unique set of genetic information. Also, the genome is different between parent and child, and between children of the same parents (with the exception of identical twins): the child is the product of half of the genetic information of both parents, and it is by chance which chromosome of each pair in each parent is transmitted to a certain child. Crossover during meiosis further complicates the inheritance pattern. In humans there are more than a million different ways to combine chromosomes from two parents, and crossover during meiosis adds a myriad of possibilities to recombine between two homologous chromosomes of each parent.
Since all individuals have their own unique set of DNA sequences, it is possible to identify everyone by their DNA. In the 1990’s, RFLP mapping was used to compare the DNA from an individual with the DNA from a sample (for example, hair left at the site of a crime), "proving" or disproving that the DNA sample came from the individual. Disproving is simple (any change in the RFLP pattern is usually significant), but proving that a sample came from a certain individual is much more difficult: many different probes will need to be used to be statistically certain that this individual (and perhaps the identical-twin sibling) is the only one in the entire universe with all these RFLP patterns.
To simplify analyses in forensic laboratories, moderately repetitive DNA (some 10-50 copies, often scattered throughout the genome) was used as a probe. Because of the repetitive nature of this DNA, probing with such DNA gives a number of restriction fragment lengths, thus obtaining a large number of data with a single probe. However, it is possible that there are some sequence differences between the probe and some copies of the repetitive sequences to which it hybridizes. This generally results in some stronger and some weaker bands. Moreover, the region of homology between the probe and some DNA fragments in some cases is relatively small, causing the band on the Southern blot to be quite weak.
The different bands visualized on one autoradiogram offer an obvious advantage: using one probe and one blot, a whole lot of different bands can be found, each of which can be viewed as an independent "witness for the prosecution" or "witness for the defense". If each of the bands are absolutely identical between the DNA from the suspect and the DNA from the site of the crime, the probability can be estimated that this would be coincidence: this probability most likely is vanishingly small. However, if one or more bands do not correspond between the two samples, it is quite certain that the two samples do not come from the same individual.
A large problem regarding the use of gene technology for forensic purposes in many instances used to be the limited availability of DNA that is to be compared to that of the suspect. For a nice Southern blot, a few µg's of DNA are required. By comparison, one single diploid cell contains only 12 pg of DNA. So, about a million cells would be needed, provided that all DNA can be extracted efficiently.
Therefore, more modern forensic techniques make use of automated DNA amplification systems by means of the polymerase chain reaction (PCR), as was discussed in a previous section. In this way, it is possible to amplify small regions of the genome, starting from DNA extracted from one or several cells. If PCR primers are chosen to border regions with a variable number of tandem repeats (VNTR), the sizes of PCR products resulting from a particular set of primers will depend on the individual whose DNA is used for amplification. Therefore, this information can be used for DNA fingerprinting as well.
The theory of DNA fingerprinting is crisp and clear. However, there have been a few cases of where the experiments and interpretations of DNA fingerprinting were rather equivocal, and to avoid this high standards have been set. The central problem is that the "real world" of a blood-soaked crime scene is a far cry from the controlled and near-immaculate environment of a research lab where the technology is being applied. While DNA in the lab is pure, DNA from cells gathered at the scene of the crime may be degraded, or may be contaminated with DNA from cells from other individuals, including the victim. Often more bands show up on the autoradiogram of DNA from the crime scene than in control DNA from the suspect and the victim. Another difficulty is determining how likely two DNA fingerprints are to match by chance. This probability very often is linked directly to the quality of the data. Nonetheless, DNA fingerprinting constitutes a very valuable addition to the forensic tests currently available. One of the companies who was a leader in the field of forensic DNA testing is Lifecodes (first taken over by Orchid (http://www.orchid.com/), which also took over CellMark, another DNA testing company, and now by Tepnel) (http://www.tepnel.com/life_codes/overview.asp). They also offer paternity testing, based on the same DNA fingerprinting principles. Sometimes, such testing can lead to surprising results: In one case it was found that the child of whom the father was in dispute was not the biological son of his mother. It was suspected that an inadvertant swap of kids had occurred in the hospital. In another case, upon in-vitro fertilization sperm vials appeared to have been swapped and a baby could be shown to have a biological father different from the spouse of his mother.
The history of DNA use for forensic cases already spans almost two decades. The first cases into which DNA evidence was brought in were in England. The first case of using DNA-related evidence in Arizona courts was the 1988 murder of Jennifer Wilson by Richard Bible near Flagstaff. Blood found on the back of Bible's plaid shirt was identified through DNA testing as Jennifer's blood -- with a probability of 14 billion-to-1. Bible was subsequently convicted. This conviction was upheld unanimously by the Arizona Supreme Court. This together with other legal opinions elsewhere have paved the way for further use of DNA-related evidence in trials. National and international scrutiny of this method has occurred during the OJ Simpson trial in the mid-nineties. While many uncertainties may remain after the trial, the admissibility and validity of DNA evidence in courts had clearly been established. Currently, DNA evidence is frequently used by the courts, and challenges of DNA-based evidence have virtually disappeared. Materials from a 2000 PBS-Frontline presentation on this topic are at http://www.pbs.org/wgbh/pages/frontline/shows/case/revolution/.
A rather unique (and sad) case of the utility of biotechnology in solving family issues was necessitated by action of the former military government in Argentina. About 12,000 people were taken as political prisoners and "disappeared" between 1976 and 1983. Children whose parents thus "disappeared" usually were adopted by military personnel, sold, or in some instances "disappeared" as well (a phenomenon still occurring on the streets of Rio de Janeiro). Grandmothers tried to keep track of what the fate of their grandchildren was. After a change to a slightly more democratic government in 1983, a number of grandmothers tried to reclaim their grandchildren. But how to prove that someone indeed is your grandchild? Testing for different enzyme types and blood groups provided a 99.8% certainty of a relationship between grandparents and grandchild, but that was insufficient proof for the courts. Then a more reliable method was applied, essentially identical to that used for identification of the czarina and three of her children (see above): hypervariable regions of the mitochondrial genome (that do not code for proteins) were sequenced. As mitochondria (the power plants of the cell) are inherited only via the mother, the sequence of the mitochondrial genome of a grandchild is identical to that of the maternal grandmother. By comparing the sequence of sufficient mitochondrial DNA, the relationship between grandmother and grandchild could be unequivocally established.
Other applications of DNA fingerprinting
Currently the principal applications of DNA fingerprinting (or DNA typing) are in forensic analysis, paternity disputes, and immigration cases. However, DNA marker systems have found many other actual or potential uses. These include: (1) monitoring the success of bone marrow transplants, (2) animal breeding, and (3) conservation biology.
Bone marrow transplants are used most frequently in people being treated for leukemia. In this operation, the patient's own cancerous bone marrow is destroyed by a combination of radiation and chemical therapy, and replaced by normal bone marrow from a healthy tissue-matched donor. Since blood cells are made in the bone marrow, the success of the operation can easily be monitored by blood DNA typing to check that the DNA in the circulating blood is that of the donor, not the patient. Any reappearance of the patient's own DNA in the blood might signal the reappearance of cancerous cells, and appropriate therapy can then be instituted.
Remarkably, the multilocus probe systems developed for human DNA typing also produce highly variable and informative patterns from a wide range of animals, birds, reptiles, amphibians and fish, and in some cases even from invertebrates. Single-locus minisatellite and microsatellite markers are also being developed from non-human species. DNA typing is already providing animal breeders with a powerful new tool. For example, stolen animals can be identified from their DNA fingerprints. Similarly, DNA typing can be used to verify the identity of semen used in artificial insemination programs, and also to establish the pedigree of the animal. Indeed, several cases involving disputes over whether a champion dog had really sired a given puppy have been satisfactorily resolved by DNA fingerprinting. In the long term, DNA markers will make it possible to construct genetic maps of domesticated animals and thereby enable the eventual localization of genes controlling economically important traits, such as disease resistance, milk yield, and body weight.
In terms of being a tool in conservation biology, DNA typing is already helping to protect endangered species in various ways. It provides for the first time a method of identifying animals stolen from the wild. This can be achieved either by setting up a database of DNA fingerprints of wild animals against which a captive individual can be compared, or by showing that young individuals held by a breeder could not be the offspring of any other individuals that the breeder has in stock. A second application in this respect is found in helping zoos in their breeding programs, in particular by identifying closely related individuals and thereby minimizing the risk of inbreeding. More generally, DNA typing is beginning to revolutionize our understanding of the genetic makeup and breeding systems of natural populations, knowledge of which is of fundamental importance in monitoring the genetic diversity and reproductive success of natural animal populations.
In addition, in the presence of DNA fingerprints of species in protected areas, poached animals can be easily identified. Several public (http://www.forensicdna.ca/labservices.html) and private institutions have specialized in molecular-genetic analysis of wildlife.
Return to Contents
Center for Bioenergy & Photosynthesis
Arizona State University
Room PSD 209
Tempe, AZ 85287-1604
13 February 2006
phone: (480) 965-1963
fax: (480) 965-2747