












































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Who are studying in Autonomous colleges can refer this for the study material of Data warehouse and Mining Theory.
Typology: Lecture notes
1 / 84
This page cannot be seen from the preview
Don't miss anything!
JNTU World
Module – I
Data Mining overview, Data Warehouse and OLAP Technology,Data Warehouse Architecture, Stepsfor the Design and Construction of Data Warehouses, A Three-Tier Data WarehouseArchitecture,OLAP,OLAP queries, metadata repository,Data Preprocessing – Data Integration and Transformation, Data Reduction,Data Mining Primitives:What Defines a Data Mining Task? Task-Relevant Data, The Kind of Knowledge to be Mined,KDD
Module – II
Mining Association Rules in Large Databases, Association Rule Mining, Market BasketAnalysis: Mining A Road Map, The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation,Generating Association Rules from Frequent Itemsets, Improving the Efficiently of Apriori,Mining Frequent Itemsets without Candidate Generation, Multilevel Association Rules, Approaches toMining Multilevel Association Rules, Mining Multidimensional Association Rules for Relational Database and Data Warehouses,Multidimensional Association Rules, Mining Quantitative Association Rules, MiningDistance-Based Association Rules, From Association Mining to Correlation Analysis
Module – III
What is Classification? What Is Prediction? Issues RegardingClassification and Prediction, Classification by Decision Tree Induction, Bayesian Classification, Bayes Theorem, Naïve Bayesian Classification, Classification by Backpropagation, A Multilayer Feed-Forward Neural Network, Defining aNetwork Topology, Classification Based of Concepts from Association Rule Mining, OtherClassification Methods, k-Nearest Neighbor Classifiers, GeneticAlgorithms, Rough Set Approach, Fuzzy Set Approachs, Prediction, Linear and MultipleRegression, Nonlinear Regression, Other Regression Models, Classifier Accuracy
Module – IV
What Is Cluster Analysis, Types of Data in Cluster Analysis,A Categorization of Major Clustering Methods, Classical Partitioning Methods: k-Meansand k-Medoids, Partitioning Methods in Large Databases: From k-Medoids to CLARANS, Hierarchical Methods, Agglomerative and Divisive Hierarchical Clustering,Density-BasedMethods, Wave Cluster: Clustering Using Wavelet Transformation, CLIQUE:Clustering High-Dimensional Space, Model-Based Clustering Methods, Statistical Approach,Neural Network Approach.
DEPT OF CSE & IT
JNTU World
DEPT OF CSE & IT
Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands- on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Data mining involves six common classes of tasks: Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (Dependency modelling) – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Regression – attempts to find a function which models the data with the least error.
JNTU World
DEPT OF CSE & IT
Summarization – providing a more compact representation of the data set, including visualization and report generation.
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search orevaluate the interestingness of resulting patterns. Such knowledge can include concepthierarchies,
JNTU World
DEPT OF CSE & IT
Data Mining is a process of discovering various models, summaries, and derived values from a given collection of data. The general experimental procedure adapted to data-mining problems involves the following steps:
Most data-based modeling studies are performed in a particular application domain. Hence, domain-specific knowledge and experience are usually necessary in order to come up with a meaningful problem statement. Unfortunately, many application studies tend to focus on the data-mining technique at the expense of a clear problem statement. In this step, a modeler usually specifies a set of variables for the unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. There may be several hypotheses formulated for a single problem at this stage. The first step requires the combined expertise of an application domain and a data-mining model. In practice, it usually means a close interaction between the data-mining expert and the application expert. In successful data-mining applications, this cooperation does not stop in the initial phase; it continues during the entire data-mining process.
This step is concerned with how the data are generated and collected. In general, there are two distinct possibilities. The first is when the data-generation process is under the control of an expert (modeler): this approach is known as a designed experiment. The second possibility is when the expert cannot influence the data- generation process: this is known as the observational approach. An observational setting, namely, random data generation, is assumed in most data-mining applications. Typically, the sampling
JNTU World
DEPT OF CSE & IT
distribution is completely unknown after data are collected, or it is partially and implicitly given in the data-collection procedure. It is very important, however, to understand how data collection affects its theoretical distribution, since such a priori knowledge can be very useful for modeling and, later, for the final interpretation of results. Also, it is important to make sure that the data used for estimating a model and the data used later for testing and applying a model come from the same, unknown, sampling distribution. If this is not the case, the estimated model cannot be successfully used in a final application of the results.
In the observational setting, data are usually "collected" from the existing databses, data warehouses, and data marts. Data preprocessing usually includes at least two common tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with most observations. Commonly, outliers result from measurement errors, coding and recording errors, and, sometimes, are natural, abnormal values. Such nonrepresentative samples can seriously affect the model produced later. There are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or b. Develop robust modeling methods that are insensitive to outliers.
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as variable scaling and different types of encoding. For example, one feature with the range [0, 1] and the other with the range [−100, 1000] will not have the same weights in the applied technique; they will also influence the final data-mining results differently. Therefore, it is recommended to scale them and bring both features to the same weight for further analysis. Also, application-specific encoding methods usually achieve
JNTU World
DEPT OF CSE & IT
techniques to validate the results. A user does not want hundreds of pages of numeric results. He does not understand them; he cannot summarize, interpret, and use them for successful decision making.
The data mining system can be classified according to the following criteria:
Database Technology Statistics Machine Learning Information Science Visualization Other Disciplines
JNTU World
DEPT OF CSE & IT
Classification according to kind of databases mined Classification according to kind of knowledge mined Classification according to kinds of techniques utilized Classification according to applications adapted
We can classify the data mining system according to kind of databases mined. Database system can be classified according to different criteria such as data models, types of data etc. And the data mining system can be classified accordingly. For example if we classify the database according to data model then we may have a relational, transactional, object- relational, or data warehouse mining system.
We can classify the data mining system according to kind of knowledge mined. It is means data mining system are classified on the basis of functionalities such as:
Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution Analysis JNTU World
DEPT OF CSE & IT
Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be expressed in high level languages, visual representations. This representations should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise, incomplete objects while mining the data regularities. If data cleaning methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of databases, wide distribution of data,and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithm divide the data into partitions which is further processed parallel. Then the results from the partitions is merged. The incremental algorithms, updates databases without having mine the data again from scratch.
1.8JNTU World Knowledge Discovery in Databases(KDD)
DEPT OF CSE & IT
Some people treat data mining same as Knowledge discovery while some people view data mining essential step in process of knowledge discovery. Here is the list of steps involved in knowledge discovery process:
Data Cleaning - In this step the noise and inconsistent data is removed. Data Integration - In this step multiple data sources are combined. Data Selection - In this step relevant to the analysis task are retrieved from the database. Data Transformation - In this step data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations. Data Mining - In this step intelligent methods are applied in order to extract data patterns. Pattern Evaluation - In this step, data patterns are evaluated. Knowledge Presentation - In this step,knowledge is represented.
JNTU World
DEPT OF CSE & IT
Subject-Oriented : A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated : A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant : Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile : Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
A data warehouse can be built using a top-down approach , a bottom-up approach , or a combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid implementation and opportunistic application of the bottom-up approach.
JNTU World
DEPT OF CSE & IT
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration, sales, or the general ledger. If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process, for example, individual transactions, individual daily snapshots, and so on. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold.
JNTU World
DEPT OF CSE & IT
supported by the underlying DBMS andallows client programs to generate SQL code to be executed at a server.
Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores information aboutthe data warehouse and its contents.
The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on multidimensional data to standard relational operations. A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations.
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
JNTU World
DEPT OF CSE & IT
1.9.3 Data Warehouse Models:
There are three data warehouse models.
An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. It typically contains detailed data aswell as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build.
A data mart contains a subset of corporate-wide data that is of value to aspecific group of users. The scope is confined to specific selected subjects. For example,a marketing data mart may confine its subjects to customer, item, and sales. Thedata contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart ismore likely to be measured in weeks rather than months or years. However, itmay involve complex integration in the long run if its design and planning werenot enterprise-wide. JNTU World