Bioinformatics and Functional Genomics

ISBN-10: 0470085851

ISBN-13: 9780470085851

The bestselling introduction to bioinformatics and functional genomics—now in an updated edition\ Widely received in its previous edition, Bioinformatics and Functional Genomics offers the most broad-based introduction to this explosive new discipline. Now in a thoroughly updated and expanded Second Edition, it continues to be the go-to source for students and professionals involved in biomedical research.\ This edition provides up-to-the-minute coverage of the fields of bioinformatics and...

Search in google:

The bestselling introduction to bioinformatics and functional genomics—now in an updated edition Widely received in its previous edition, Bioinformatics and Functional Genomics offers the most broad-based introduction to this explosive new discipline. Now in a thoroughly updated and expanded Second Edition, it continues to be the go-to source for students and professionals involved in biomedical research. This edition provides up-to-the-minute coverage of the fields of bioinformatics and genomics. Features new to this edition include: Several fundamentally important proteins, such as globins, histones, insulin, and albumins, are included to better show how to apply bioinformatics tools to basic biological questions A completely updated companion Web site, which will be updated as new information becomes available Descriptions of genome sequencing projects spanning the tree of life A stronger focus on how bioinformatics tools are used to understand human disease The book is complemented by lavish illustrations and more than 500 figures and tables—fifty of which are entirely new to this edition. Each chapter includes a Problem Set, Pitfalls, Boxes explaining key techniques and mathematics/statistics principles, Summary, Recommended Reading, and a list of freely available software. Readers may visit a related Web page for supplemental information at www.wiley.com/go/pevsnerbioinformatics. Bioinformatics and Functional Genomics, Second Edition serves as an excellent single-source textbook for advanced undergraduate and beginning graduate-level courses in the biological sciences and computer sciences. It is also an indispensable resource for biologists in a broad variety of disciplines who use the tools of bioinformatics and genomics to study particular research problems; bioinformaticists and computer scientists who develop computer algorithms and databases; and medical researchers and clinicians who want to understand the genomic basis of viral, bacterial, parasitic, or other diseases.

Bioinformatics and Functional Genomics\ \ By Jonathan Pevsner \ John Wiley & Sons\ Copyright © 2003 Wiley-Liss\ All right reserved.\ ISBN: 0-471-21004-8 \ \ \ Chapter One\ Introduction \ Bioinformatics represents a new field at the interface of the twentieth-century revolutions in molecular biology and computers. A focus of this new discipline is the use of computer databases and computer algorithms to analyze proteins, genes, and the complete collections of deoxyribonucleic acid (DNA) that comprises an organism (the genome). A major challenge in biology is to make sense of the enormous quantities of sequence data and structural data that are generated by genome-sequencing projects, proteomics, and other large-scale molecular biology efforts. The tools of bioinformatics include computer programs that help to reveal fundamental mechanisms underlying biological problems related to the structure and function of macromolecules, biochemical pathways, disease processes, and evolution.\ According to a National Institutes of Health (NIH) definition, bioinformatics is "research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data." The related discipline of computational biology is "the development and application of data-analytical and theoretical methods, mathematicalmodeling and computational simulation techniques to the study of biological, behavioral, and social systems."\ While the discipline of bioinformatics focuses on the analysis of molecular sequences, genomics and functional genomics are two closely related disciplines. The goal of genomics is to determine and analyze the complete DNA sequence of an organism, that is, its genome. The DNA encodes genes, which can be expressed as ribonucleic acid (RNA) transcripts and then translated into protein. Functional genomics describes the use of genomewide assays to the study of gene and protein function.\ The aim of this book is to explain both the theory and practice of bioinformatics. The book is especially designed to help the biology student use computer programs and databases to solve biological problems related to proteins, genes, and genomes. Bioinformatics is an integrative discipline, and our focus on individual proteins and genes is part of a larger effort to understand broad issues in biology such as the relationship of structure to function, development, and disease.\ Organization of The Book\ There are three main sections of the book. The first part explains how to access biological sequence data, particularly DNA and protein sequences (Chapter 2). Once sequences are obtained, we show how to compare two sequences (pairwise alignment; Chapter 3) and how to compare multiple sequences [primarily by the Basic Local Alignment Search Tool (BLAST); Chapters 4 and 5].\ The second part of the book describes functional genomics approaches to RNA and protein. The central dogma of biology states that DNA is transcribed into RNA then translated into protein. We will examine gene expression, including a description of the emerging technology of DNA microarrays (Chapters 6 and 7). We then consider proteins from the perspective of protein families, the analysis of individual proteins, protein structure, and multiple sequence alignment (Chapters 8-10). The relationships of protein and DNA sequences that are multiply aligned can be visualized in phylogenetic trees (Chapter 11). Chapter 11 thus introduces the subject of molecular evolution.\ Since 1995, the genomes have been sequenced for several hundred bacteria and archaea as well as fungi, animals, and plants. The third section of the book covers genome analysis. Chapter 12 provides an overview of the study of completed genomes and then descriptions of how the tools of bioinformatics can elucidate the tree of life. We describe bioinformatics resources for the study of viruses (Chapter 13) and bacteria and archaea (Chapter 14; these are two of the three main branches of life). Next we examine a variety of eukaryotes (from fungi to primates; Chapters 15 and 16) and then the human genome (Chapter 17). Finally, we explore bioinformatic approaches to human disease (Chapter 18).\ Bioinformatics: The Big Picture\ We can summarize the entire field of bioinformatics with three perspectives. The first perspective on bioinformatics is the cell (Fig. 1.1). The central dogma of molecular biology is that DNA is transcribed into RNA and translated into protein. The focus of molecular biology has been on individual genes, messenger RNA (mRNA) transcripts, and proteins. A focus of the field of bioinformatics is the complete collection of DNA (the genome), RNA (the transcriptome), and protein sequences (the proteome) that have been amassed (Henikoff, 2002). These millions of molecular sequences present both great opportunities and great challenges. A bioinformatics approach to molecular sequence data involves the application of computer algorithms and computer databases to molecular and cellular biology. Such an approach is sometimes referred to as functional genomics. This typifies the essential nature of bioinformatics: biological questions can be approached from levels ranging from single genes and proteins to cellular pathways and networks or even whole genomic responses (Ideker et al., 2001). Our goals are to understand how to study both individual genes and proteins and collections of thousands of genes/proteins.\ From the cell we can focus on individual organisms, which represents the second perspective of the field of bioinformatics (Fig. 1.2). Each organism changes across different stages of development and (formulticellular organisms) across different regions of the body. For example, while we may sometimes think of genes as static entities that specify features such as eye color or height, they are in fact dynamically regulated across time and region and in response to physiological state. Gene expression varies in disease states or in response to a variety of signals, both intrinsic and environmental. Many bioinformatics tools are available to study the broad biological questions relevant to the individual: There are many databases of expressed genes and proteins derived from different tissues and conditions. One of the most powerful applications of functional genomics is the use of DNA microarrays to measure the expression of thousands of genes in biological samples.\ At the largest scale is the tree of life (Fig. 1.3) (Chapter 12). There are many millions of species alive today, and they can be grouped into the three major branches of bacteria, archaea (single-celled microbes that tend to live in extreme environments), and eukaryotes. Molecular sequence databases currently hold DNA sequence from over 100,000 different organisms. The complete genome sequences of several hundred organisms will soon become available. One of the main lessons we are learning is the fundamental unity of life at the molecular level. We are also coming to appreciate the power of comparative genomics, in which genomes are compared.\ Figure 1.4 on the following page presents the contents of this book in the context of the three perspectives of bioinformatics.\ A Consistent Example: Retinol-Binding Protein\ Throughout this book we will focus on the example of a gene and its corresponding protein product: retinol-binding protein (RBP4), a small, abundant secreted protein that binds retinol (vitamin A) in blood (Newcomer and Ong, 2000). Retinol, obtained from carrots in the form of vitamin A, is very hydrophobic. RBP4 helps transport this ligand to the eye where it is used for vision. We will study RBP4 in detail because it has a number of interesting features:\ There are many proteins that are homologous to RBP4 in a variety of species, including human, mouse, and fish ("orthologs"). We will use these as examples of how to align proteins, perform database searches, and study phylogeny (Chapters 2-11). There are other human proteins that are closely related to RBP4 ("paralogs"). Altogether the family that includes RBP4 is called the lipocalins, a diverse group of small ligand-binding proteins that tend to be secreted into extracellular spaces (Akerstrom et al., 2000; Flower et al., 2000). Other lipocalins have fascinating functions such as apoliprotein D (which binds cholesterol), a pregnancy-associated lipocalin, aphrodisin (an "aphrodisiac" in hamsters), and an odorant-binding protein in mucus. There are even bacterial lipocalins, which could have a role in antibiotic resistance (Bishop, 2000). We will explore how bacterial lipocalins could be ancient genes that entered eukaryotic genomes by a process called lateral gene transfer. The gene expression levels of some lipocalins are dramatically regulated (Chapters 6 and 7). Because the lipocalins are small, abundant, and soluble proteins, their biochemical properties have been characterized in detail. The three-dimensional protein structure has been solved for several of them by X-ray crystallography (Chapter 9). Some lipocalins have been implicated in human disease (Chapter 18).\ Another molecule we will introduce is the pol (polymerase) gene of human immunodeficiency virus 1 (HIV-1). HIV presents one of the greatest public health challenges in the world today. Over 42 million people are infected as of the end of the year 2002 and over 16 million people have died. The HIV-1 genome encodes just nine proteins, including pol (Frankel and Young, 1998). We will examine pol throughout the book because the properties of this gene, its protein products, and the HIV-1 genome are distinct from the lipocalins.\ The pol gene is a multidomain protein: it is a single polypeptide with several structurally and functionally distinct domains. The pol gene encodes a protein of 1003 amino acids with reverse transcriptase activity (that is, an RNA-dependent DNA polymerase). It is also an aspartyl protease, and it has integrase activity. These multiple activities are typical of multidomain proteins. The modular nature of the pol protein affects our ability to perform database searches (Chapters 4 and 5) and multiple sequence alignments (Chapters 8 and 10). The pol gene incorporates substitutions extremely rapidly. A typical individual infected by HIV may have over a million variants of pol. The study of the evolution of pol complements our study of the lipocalins (Chapter 11). As a viral protein, our study of pol gives us the opportunity to learn how to access bioinformatics resources relevant to studying viruses (Chapter 13). Database searches with pol will help emphasize how to restrict searches to particular domains of the tree of life.\ Organization of The Chapters\ The chapters of this book are intended to provide both the theory of bioinformatics subjects as well as a practical guide to using computer databases and algorithms. Web resources are provided throughout each chapter. Chapters end with brief sections called Perspective and Pitfalls. The perspective feature describes the rate of growth of the subject matter in each chapter. For example, a perspective on Chapter 2 (access to sequence information) is that the amount of DNA sequence data deposited in GenBank is undergoing an explosive rate of growth. In contrast, an area such as pairwise sequence alignment, which is fundamental to the entire field of bioinformatics (Chapter 3), was firmly established in the 1970s and 1980s.\ The pitfalls section of each chapter describes some common difficulties encountered by biologists using bioinformatics tools. Some errors might seem trivial, such as searching a DNA database with a protein sequence. Other pitfalls are more subtle, such as artifacts caused by multiple sequence alignment programs depending upon the type of algorithm that is selected. Indeed, while the field of bioinformatics depends substantially on analyzing sequence data, it is important to recognize that there are many categories of errors associated with data generation, collection, storage, and analysis.\ Each chapter offers multiple-choice quizzes, which test your understanding of the chapter materials. There are also problems that require you to apply the concepts presented in each chapter. These problems may form the basis of a computer laboratory for a bioinformatics course.\ The references at the end of each chapter are accompanied by an annotated list of recommended articles. This suggested reading section includes classic papers that show how the principles described in each chapter were discovered. Particularly helpful review articles and research papers are highlighted.\ Suggestions For Students and Teachers: Web Exercises And Find-a-Gene\ Often, students of bioinformatics have a particular research area of interest such as a gene, a physiological process, a disease, or a genome. It is hoped that by studying RBP4 and other specific proteins and genes throughout this book, students can simultaneously apply the principles of bioinformatics to their own research questions.\ In teaching a course on bioinformatics at Johns Hopkins, it has been helpful to complement lectures with computer labs. All the websites described in this book are freely available on the World WideWeb, and many of the software packages are free for academic use.\ Another feature of the Johns Hopkins course is that each student is required to discover a novel gene by the last day of the course. The student must begin with any protein sequence of interest and perform database searches to identify genomic DNA that encodes a protein no one has described before. This problem is described in Chapter 5 (see Fig. 5.17). The student thus chooses the name of the gene and its corresponding protein and describes information about the organism and evidence that the gene has not been described before. Then, the student creates a multiple sequence alignment of the new protein (or gene) and creates a phylogenetic tree showing its relation to other known sequences.\ Each year, some beginning students are slightly apprehensive about accomplishing this exercise, but in the end all of them succeed. A benefit of this exercise is that it requires a student to actively use the principles of bioinformatics. Most students choose a gene (or protein) relevant to their own research area, while others find new lipocalins.\ Teaching bioinformatics is notable for the diversity of students learning this new discipline. Each chapter provides background on the subject matter. For more advanced students, several key research papers are cited at the end of each chapter. These papers are technical, and reading them along with the chapters will provide a deeper understanding of the material. The suggested reading section also includes review articles.\ \ Continues...\ \ \ \ Excerpted from Bioinformatics and Functional Genomics by Jonathan Pevsner Copyright © 2003 by Wiley-Liss. Excerpted by permission.\ All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.\ Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site. \ \

ForewordPrefacePt. IAnalyzing DNA, RNA, and Protein Sequences in Databases1Introduction32Access to Sequence Data and Literature Information153Pairwise Sequence Alignment414Basic Local Alignment Search Tool (BLAST)875Advanced BLAST Searching127Pt. IIGenomewide Analysis of RNA and Protein6Bioinformatic Approaches to Gene Expression1577Gene Expression: Microarray Data Analysis1898Protein Analysis and Proteomics2239Protein Structure27310Multiple Sequence Alignment31911Molecular Phylogeny and Evolution357Pt. IIIGenome Analysis12Completed Genomes and the Tree of Life39713Completed Genomes: Viruses43714Completed Genomes: Bacteria and Archaea46515Eukaryotic Genomes: Fungi50316Eukaryotic Genomes: From Parasites to Primates53917Human Genome60718Human Disease647Epilogue695AppGCG for Protein and DNA Analysis697Glossary717Solutions to Self-Test Quizzes735Subject Index737Author Index753