Handbook of Statistical Analysis and Data Mining Applications

Author: Robert Nisbet

ISBN-10: 0123747651

ISBN-13: 9780123747655

Category: Data Warehousing & Mining

Search in google:

The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions.Written "By Practitioners for Practitioners" Non-technical explanations build understanding without jargon and equations Tutorials in numerous fields of study provide step-by-step instruction on how to use supplied tools to build models using Statistica, SAS and SPSS softwarePractical advice from successful real-world implementations Includes extensive case studies, examples, MS PowerPoint slides and datasets CD-DVD with valuable fully-working 90-day software included: "Complete Data Miner - QC-Miner - Text Miner" bound with book Data mining practitioners, here is your bible, the complete "driver's manual" for data mining. From starting the engine to handling the curves, this book covers the gamut of data mining techniques—including predictive analytics and text mining—illustrating how to achieve maximal value across business, scientific, engineering and medical applications. What are the best practices through each phase of a data mining project? How can you avoid the most treacherous pitfalls? The answers are in here.Going beyond its responsibility as a reference book, this resource also provides detailed tutorials with step-by-step instructions to drive established data mining software tools across real world applications. This way, newcomers start their engines immediately and experience hands-on success.If you want to roll-up your sleeves and execute on predictive analytics, this is your definite, go-to resource. To put it lightly, if this book isn't on your shelf, you're not a data miner.- Eric Siegel, Ph.D., President, Prediction Impact, Inc. and Founding Chair, Predictive Analytics World

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS\ \ By Robert Nisbet John Elder Gary Miner \ Academic Press\ Copyright © 2009 Elsevier Inc.\ All right reserved.\ ISBN: 978-0-08-091203-5 \ \ \ Chapter One\ The Background for Data Mining Practice\ OUTLINE\ Preamble 3\ A Short History of Statistics and Data Mining 4\ Modern Statistics: A Duality? 5\ Two Views of Reality 8\ The Rise of Modern Statistical Analysis: The Second Generation 10\ Machine Learning Methods: The Third Generation 11\ Statistical Learning Theory: The Fourth Generation 12\ Postscript 13\ PREAMBLE\ You must be interested in learning how to practice data mining; otherwise, you would not be reading this book. We know that there are many books available that will give a good introduction to the process of data mining. Most books on data mining focus on the features and functions of various data mining tools or algorithms. Some books do focus on the challenges of performing data mining tasks. This book is designed to give you an introduction to the practice of data mining in the real world of business.\ One of the first things considered in building a business data mining capability in a company is the selection of the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost effective in your organization than a more expensive tool. How do you choose your data mining tool? Few reviews are available. The best listing of tools by popularity is maintained and updated yearly by KDNuggets.com. Some detailed reviews available in the literature go beyond just a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3). The interest in an unbiased and detailed comparison is great. We are told the "most downloaded document in data mining" is the comprehensive but decade-old tool review by Elder and Abbott (1998).\ The other considerations in building a business's data mining capability are forming the data mining team, building the data mining platform, and forming a foundation of good data mining practice. This book will not discuss the building of the data mining platform. This subject is discussed in many other books, some in great detail. A good overview of how to build a data mining platform is presented in Data Mining: Concepts and Techniques (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common data mining tools.\ Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter ... it does matter, whether or not we recognize it initially. The reason it matters is that the capabilities of statistical and data mining methodology were not developed in a vacuum. Analytical methodology was developed in the context of prevailing statistical and analytical theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science. From this beginning developed modern statistical analysis and data mining. To understand the strengths and limitations of this body of methodology and use it effectively, we must understand the strengths and limitations of the statistical theory from which they developed. This theory was developed by scientists and mathematicians who "thought" it out. But this thinking was not one sided or unidirectional; there arose several views on how to solve analytical problems. To understand how to approach the solving of an analytical problem, we must understand the different ways different people tend to think. This history of statistical theory behind the development of various statistical techniques bears strongly on the ability of the technique to serve the tasks of a data mining project.\ A SHORT HISTORY OF STATISTICS AND DATA MINING\ Analysis of patterns in data is not new. The concepts of average and grouping can be dated back to the 6th century BC in Ancient China, following the invention of the bamboo rod abacus (Goodman, 1968). In Ancient China and Greece, statistics were gathered to help heads of state govern their countries in fiscal and military matters. (This makes you wonder if the words statistic and state might have sprung from the same root.) In the sixteenth and seventeenth centuries, games of chance were popular among the wealthy, prompting many questions about probability to be addressed to famous mathematicians (Fermat, Leibnitz, etc.). These questions led to much research in mathematics and statistics during the ensuing years.\ MODERN STATISTICS: A DUALITY?\ Two branches of statistical analysis developed in the eighteenth century: Bayesian and classical statistics. (See Figure 1.1.) To treat both fairly in the context of history, we will consider both in the First Generation of statistical analysis. For the Bayesians, the probability of an event's occurrence is equal to the probability of its past occurrence times the likelihood of its occurrence in the future. Analysis proceeds based on the concept of conditional probability: the probability of an event occurring given that another event has already occurred. Bayesian analysis begins with the quantification of the investigator's existing state of knowledge, beliefs, and assumptions. These subjective priors are combined with observed data quantified probabilistically through an objective function of some sort. The classical statistical approach (that flowed out of mathematical works of Gauss and Laplace) considered that the joint probability, rather than the conditional probability, was the appropriate basis for analysis. The joint probability function expresses the probability that simultaneously X takes the specific values x and Y takes value y, as a function of x and y.\ Interest in probability picked up early among biologists following Mendel in the latter part of the nineteenth century. Sir Francis Galton, founder of the School of Eugenics in England, and his successor, Karl Pearson, developed the concepts of regression and correlation for analyzing genetic data. Later, Pearson and colleagues extended their work to the social sciences. Following Pearson, Sir R. A. Fisher in England developed his system for inference testing in medical studies based on his concept of standard deviation. While the development of probability theory flowed out of the work of Galton and Pearson, early predictive methods followed Bayes's approach. Bayesian approaches to inference testing could lead to widely different conclusions by different medical investigators because they used different sets of subjective priors. Fisher's goal in developing his system of statistical inference was to provide medical investigators with a common set of tools for use in comparison studies of effects of different treatments by different investigators. But to make his system work even with large samples, Fisher had to make a number of assumptions to define his "Parametric Model."\ Assumptions of the Parametric Model\ 1. Data Fits a Known Distribution (e.g., Normal, Logistic, Poisson, etc.)\ Fisher's early work was based on calculation of the parameter standard deviation, which assumes that data are distributed in a normal distribution. The normal distribution is bell-shaped, with the mean (average) at the top of the bell, with "tails" falling off evenly at the sides. Standard deviation is simply the "average" of the absolute deviation of a value from the mean. In this calculation, however, averaging is accomplished by dividing the sum of the absolute deviations by the total – 1. This subtraction expresses (to some extent) the increased uncertainty of the result due to grouping (summing the absolute deviations). Subsequent developments used modified parameters based on the logistic or Poisson distributions. The assumption of a particular known distribution is necessary in order to draw upon the characteristics of the distribution function for making inferences. All of these parametric methods run the gauntlet of dangers related to force-fitting data from the real world into a mathematical construct that does not fit.\ 2. Factor Independency\ In parametric predictive systems, the variable to be predicted (Y) is considered as a function of predictor variables (X's) that are assumed to have independent effects on Y. That is, the effect on Y of each X-variable is not dependent on effects on Y of any other X- variable. This situation could be created in the laboratory by allowing only one factor (e.g., a treatment) to vary, while keeping all other factors constant (e.g., temperature, moisture, light, etc.). But, in the real world, such laboratory control is absent. As a result, some factors that do affect other factors are permitted to have a joint effect on Y. This problem is called collinearity. When it occurs between more than two factors, it is termed multicollinearity. The multicollinearity problem led statisticians to use an interaction term in the relationship that supposedly represented the combined effects. Use of this interaction term functioned as a magnificent kluge, and the reality of its effects was seldom analyzed. Later development included a number of interaction terms, one for each interaction the investigator might be presenting.\ 3. Linear Additivity\ Not only must the X-variables be independent, their effects on Y must be cumulative and linear. That means the effect of each factor is added to or subtracted from the combined effects of all X-variables on Y. But what if the relationship between Y and the predictors (X-variables) is not additive, but multiplicative or divisive? Such functions can be expressed only by exponential equations that usually generate very nonlinear relationships. Assumption of linear additivity for these relationships may cause large errors in the predicted outputs. This is often the case with their use in business data systems.\ 4. Constant Variance (Homoscedasticity)\ The variance throughout the range of each variable is assumed to be constant. This means that if you divided the range of a variable into bins, the variance across all records for bin #1 is the same as the range for all the other bins in the range of that variable. If the variance throughout the range of a variable differs significantly from constancy, it is said to be heteroscedastic. The error in the predicted value caused by the combined heteroscedasticity among all variables can be quite significant.\ 5. Variables Must Be Numerical and Continuous\ The assumption that variables must be numerical and continuous means that data must be numeric (or it must be transformable to a number before analysis) and the number must be part of a distribution that is inherently continuous. Integer values in a string are not continuous; they are discrete. Classical parametric statistical methods are not valid for use with discrete data, because the probability distributions for continuous and discrete data are different. But both scientists and business analysts have used them anyway.\ In his landmark paper, Fisher (1921; see Figure 1.2) began with the broad definition of probability as the intrinsic probability of an event's occurrence divided by the probability of occurrence of all competing events (very Bayesian). By the end of his paper, Fisher modified his definition of probability for use in medical analysis (the goal of his research) as the intrinsic probability of an event's occurrence period. He named this quantity likelihood. From that foundation, he developed the concepts of standard deviation based on the normal distribution. Those who followed Fisher began to refer to likelihood as probability. The concept of likelihood approaches the classical concept of probability only as the sample size becomes very large and the effects of subjective priors approach zero (von Mises, 1957). In practice, these two conditions may be satisfied sufficiently if the initial distribution of the data is known and the sample size is relatively large (following the Law of Large Numbers).\ Why did this duality of thought arise in the development of statistics? Perhaps it is because of the broader duality that pervades all of human thinking. This duality can be traced all the way back to the ancient debate between Plato and Aristotle.\ TWO VIEWS OF REALITY\ Whenever we consider solving a problem or answering a question, we start by conceptualizing it. That means we do one of two things: (1) try to reduce it to key elements or (2) try to conceive of it in general terms. We call people who take each of these approaches "detail people" and "big picture people," respectively. What we don't consider is that this distinction has its roots deep in Greek philosophy in the works of Aristotle and Plato.\ Aristotle\ Aristotle (Figure 1.3) believed that the true being of things (reality) could be discerned only by what the eye could see, the hand could touch, etc. He believed that the highest level of intellectual activity was the detailed study of the tangible world around us. Only in that way could we understand reality. Based on this approach to truth, Aristotle was led to believe that you could break down a complex system into pieces, describe the pieces in detail, put the pieces together and understand the whole. For Aristotle, the "whole" was equal to the sum of its parts. This nature of the whole was viewed by Aristotle in a manner that was very machine-like.\ Science gravitated toward Aristotle very early. The nature of the world around us was studied by looking very closely at the physical elements and biological units (species) that composed it. As our understanding of the natural world matured into the concept of the ecosystem, it was discovered that many characteristics of ecosystems could not be explained by traditional (Aristotelian) approaches. For example, in the science of forestry, we discovered that when a tropical rain forest is cut down on the periphery of its range, it may take a very long time to regenerate (if it does at all). We learned that the reason for this is that in areas of relative stress (e.g., peripheral areas), the primary characteristics necessary for the survival and growth of tropical trees are maintained by the forest itself! High rainfall leaches nutrients down beyond the reach of the tree roots, so almost all of the nutrients for tree growth must come from recently fallen leaves and branches. When you cut down the forest, you remove that source of nutrients. The forest canopy also maintains favorable conditions of light, moisture, and temperature required by the trees. Removing the forest removes the very factors necessary for it to exist at all in that location. These factors emerge only when the system is whole and functioning. Many complex systems are like that, even business systems. In fact, these emergent properties may be the major drivers of system stability and predictability.\ To understand the failure of Aristotelian philosophy for completely defining the world, we must return to Ancient Greece and consider Aristotle's rival, Plato.\ Plato\ Plato (Figure 1.4) was Aristotle's teacher for 20 years, and they both agreed to disagree on the nature of being. While Aristotle focused on describing tangible things in the world by detailed studies, Plato focused on the world of ideas that lay behind these tangibles. For Plato, the only thing that had lasting being was an idea. He believed that the most important things in human existence were beyond what the eye could see and the hand could touch. Plato believed that the influence of ideas transcended the world of tangible things that commanded so much of Aristotle's interest. For Plato, the "whole" of reality was greater than the sum of its tangible parts.\ (Continues...)\ \ \ \ \ Excerpted from HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS by Robert Nisbet John Elder Gary Miner Copyright © 2009 by Elsevier Inc.. Excerpted by permission of Academic Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.\ Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site. \ \

Preface Forwards (Dean Abbott and Tony Lachenbruch)IntroductionPART I: History of Phases of Data Analysis, Basic Theory, and the Data Mining ProcessChapter 1. History - The Phases of Data Analysis throughout the Ages Chapter 2. Theory Chapter 3. The Data Mining Process Chapter 4. Data Understanding and Preparation Chapter 5. Feature Selection - Selecting the Best Variables Chapter 6: Accessory Tools and Advanced Features in Data PART II: - The Algorithms in Data Mining and Text Mining, and the Organization of the Three most common Data Mining ToolsChapter 7. Basic Algorithms Chapter 8: Advanced Algorithms Chapter 9. Text Mining Chapter 10. Organization of 3 Leading Data Mining Tools Chapter 11. Classification Trees = Decision Trees Chapter 12. Numerical Prediction (Neural Nets and GLM)Chapter 13. Model Evaluation and Enhancement Chapter 14. Medical Informatics Chapter 15. Bioinformatics Chapter 16. Customer Response Models Chapter 17. Fraud Detection PART III: Tutorials - Step-by-Step Case Studies as a Starting Point to learn how to do Data Mining AnalysesListing of Guest Authors of the Tutorials Tutorials within the book pages:How to use the DMRecipe Aviation Safety using DMRecipe Movie Box-Office Hit Prediction using SPSS CLEMENTINE Bank Financial data - using SAS-EM Credit Scoring CRM Retention using CLEMENTINE Automobile - Cars - Text Mining Quality Control using Data Mining Three integrated tutorials from different domains, but all using C&RT to predict and display possible structural relationships among data:Business Administration in a Medical Industry Clinical Psychology- Finding Predictors of Correct Diagnosis Education - Leadership Training: for Business and Education Additional tutorials are available either on the accompanying CD-DVD, or the Elsevier Web site for this book Listing of Tutorials on Accompanying CDPART IV: Paradox of Complex Models; using the “right model for the right use”, on-going development, and the Future.Chapter 18: Paradox of Ensembles and Complexity Chapter 19: The Right Model for the Right Use Chapter 20: The Top 10 Data Mining Mistakes Chapter 21: Prospect for the Future - Developing Areas in Data Mining Chapter 22: SummaryGLOSSARY of STATISICAL and DATA MINING TERMS INDEX CD - With Additional Tutorials, data sets, Power Points, and Data Mining software (STATISTICA Data Miner & Text Miner & QC-Miner - 90 day free trial)

\ From the Publisher“Great introduction to the real-world process of data mining. The overviews, practical advise, tutorials, and extra CD material make this book an invaluable resource for both new and experienced data miners.”\ - Karl Rexer, PhD (President & Founder of Rexer Analytics, Boston, Massachusetts)\ If you want to roll-up your sleeves and execute on predictive analytics, this is your definite, go-to resource. To put it lightly, if this book isn't on your shelf, you're not a data miner.\ - Eric Siegel, Ph.D., President, Prediction Impact, Inc. and Founding Chair, Predictive Analytics World\ \ \