Data Mining Techniques
With the development of Information Technology a large amount of databases and huge amount of data in various areas has been generated. The research in different databases and information technology has always given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extracting useful information and patterns from large amount of data and is called as knowledge discovery process, knowledge mining from data, knowledge extraction or data analysis or pattern analysis.
Data mining is a logical process that searches useful data from a large amount of raw data. The main goal of this technique is to find previously unknown patterns. Once these patterns are found, they can further be used to make certain decisions for machine learning and predicting analysis.
Data mining involves three steps:
A. Exploration: firstly the data is cleaned and transformed to important variables and then nature of data based on the problem are determined.
B. Pattern Identification: After the exploration, refining and defining of data for the specific variables the second step is to form pattern identification. Identify and choose the patterns which make the best prediction.
C. Deployment: Finally the patterns are put into use for desired outcome.
Data Mining Algorithms And Techniques
Knowledge is discovered from available databases with the use of different kind of algorithms and techniques like Classification, Clustering, Regression, Artificial Intelligence, Neural Networks, Association Rules, Decision Trees, Genetic Algorithm, Nearest Neighbour method etc.
Classification is a data mining technique that assigns categories to a collection if data in order to aid in more accurate predictions and analysis. One of its several methods is decision tree. The goal is to set of classification rules that will answer a question, make decision or predict behavior. To start a set of training data is developed that contains a certain set of attributes as well as the likely outcome. The job of classification algorithm is to discover how the set of attributes reaches its conclusion. Different types of classification models are classification by decision tree, Neural Networks, Support Vector Machine.
Clustering can be said as identification of similar classes of objects. By using clustering techniques we can further identify dense and sparse regions in object space and can discover overall distribution pattern and correlations among data attributes. Clustering approach can also be used for effective means of distinguishing groups or classes of object. But, it becomes costly so clustering can be used as pre-processing approach for attribute subset selection and classification. For example, to form group of customers based on purchasing patterns, to categories genes with similar functionality. Partitioning Methods, Hierarchical Agglomerative (divisive) methods Density based methods, Grid-based methods Model-based methods are the different types of clustering methods
Regression technique can be adapted for prediction. Regression analysis can be used to model the relationship between one or more independent variables and dependent variables. In data mining attributes already known are independent variables and what we want to predict are the response variables. Unfortunately, many real-world problems are not simply prediction. For instance, sales volumes, stock prices, and product failure rates are all very difficult to predict because they may depend on complex interactions of multiple predictor variables. Therefore, more complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary to forecast future values. The same model types can often be used for both regression and classification. For example, the CART (Classification and Regression Trees) decision tree algorithm can be used to build both classification trees (to classify categorical response variables) and regression trees (to forecast continuous response variables). Neural networks too can create both classification and regression models.
Different types of regression methods are Linear Regression, Multivariate Linear Regression, Nonlinear Regression, and Multivariate Nonlinear Regression
D. Association rule
Association and correlation is usually to find frequent item set findings among large data sets. This type of findings helps to make certain decisions, such as catalogue design, cross marketing and customer shopping behavior analysis. Association Rule algorithms need to be able to generate rules with confidence values less than one. However the number of possible Association Rules for a given data set is generally very large and a high proportion of the rules are usually of little value.
Different types of association rule are Multi-level association rule, Multidimensional association rule and Quantitative association rule
E. Neural networks
Neural network is a set of connected input/output units and each connection has a weight present with it. During the learning phase, network learns by adjusting weights so as to be able to predict the correct class labels of the input tuples. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are complex to be noticed by either humans or other computer techniques. These are well suited for continuous valued inputs and outputs. Neural networks are best at identifying patterns or trends in data and well suited for prediction or forecasting needs.
Data mining is an essential process where intelligent methods are applied to extract data patterns. It has an important significance regarding finding the patterns, forecasting, discovery of complete knowledge etc., in different field of Information Technology. Data mining techniques and algorithms such as classification, clustering etc., helps in finding the patterns in accordance with the certain similar characteristics of the data. Data mining has wide application domain almost in every industry where the data is generated, this is why data mining is considered to be one of the most important frontiers in database and information systems and also the most promising interdisciplinary developments in Information Technology.