### Recommended

Subscribe to Free Computer tutorial, Computer Science Tutorial and Internet Business Tutorial by Email

Computer tutorial is increase every years actually for  Data mining can be used to solve hundreds of business problems. Based on the nature of these problems, we can group them into the following data mining tasks. Trending in data mining always growth follow the computer trend and data development. When we talk about data mining we cannot forget data, information, and knowledge concept.

Data mining is very important for many company around the world, what is data mining ? why data mining very important for a lot of company, even though data mining is the key for success in business and competitor for various field of company.

## Classification

Classification is one of the most popular data mining tasks. Business problems like churn analysis, risk management and ad targeting usually involve  classification. There are some business problems for data mining base on its company. Classification refers to assigning cases into categories based on a predictable attribute. Each case contains a set of attributes, one of which is the class attribute (predictable attribute). The task requires finding a model that describes the class attribute as a function of input attributes. In the College Plans dataset previously described, the class is the College Plans attribute with two states: Yes and No. To train a classification model, you need to know the class value of input cases in the training dataset, which are usually the historical data. Data mining algorithms that require a target to learn against are considered supervised  algorithms. Typical classification algorithms include decision trees, neural network in database, and Naïve Bayes.

### Clustering

Clustering is also called segmentation. It is used to identify natural groupings of cases based on a set of attributes. Cases within the same group have more or less similar attribute values. To avoid large amount of data, sometimes data mining vendor and product do a database archiving. Figure below displays a simple customer dataset containing two attributes: age and income. The clustering algorithm groups the dataset into three segments based on these two attributes.

Cluster 1 contains the younger population with a low income. Cluster 2 contains middle-aged customers with higher incomes. Cluster 3 is a group of senior individuals with a relatively low income. Clustering is an unsupervised data mining tasks. No single attribute is used to guide the training process. All input attributes are treated equally. Most clustering algorithms build the model through a number of iterations and stop when the model converges, that is, when the boundaries of these segments are stabilized.

#### Association

Association is another popular data mining task. Association is also called market basket analysis. A typical association business problem is to analyze a sales transaction table and identify those products often sold in the same shopping basket. The common usage of association is to identify common sets of items (frequent itemsets) and rules for the purpose of cross-selling.

Association is interaction and data communication to provide usefull information. In terms of association, each product, or more generally, each attribute/value pair is considered an item. The association task has two goals: to find frequent itemsets and to find association rules.

Most association type algorithms find frequent itemsets by scanning the dataset multiple times. The frequency threshold (support) is defined by the user before processing the model. For example, support = 2% means that the model analyzes only items that appear in at least 2% of shopping carts. Afrequent itemset may look like {Product = “Pepsi”, Product = “Chips”, Product = “Juice”}. Each itemset has a size, which is the number of items that it contains. The size of this particular itemset is 3.

Association algorithm is very depend on database design and contruction, how important database design is depend on its size of data. Some of them use distributed database to avoid the crash of DBMS. Apart from identifying frequent itemsets based on support, most association type algorithms also find rules. An association rule has the form A, B => C with a probability, where A, B, C are all frequent item sets. The probability is also Cluster 2, Cluster 3, Cluster 1, Age, Income referred to as the confidence in data mining literature.

The probability is a threshold value that the user needs to specify before training an association model. For example, the following is a typical rule: Product = “Pepsi”, Product = “Chips” => Product = “Juice” with an 80% probability. The interpretation of this rule is straightforward. If a customer buys Pepsi and chips, there is an 80% chance that he or she may also buy juice. Figure above displays the product association patterns. Each node in the figure represents a product, each edge represents the relationship. This concept is similiar with Entity Relationship Diagram. The direction of the edge represents the direction of the prediction. For example, the edge from Milk to Cheese indicates that those who purchase milk might also purchase cheese.

###### Regression

The regression task is similar to classification. The main difference is that the predictable attribute is a continuous number. Regression techniques have been widely studied for centuries in the field of statistics. Linear regression and logistic regression are the most popular regression methods. Other regression techniques include regression trees and neural networks.

Regression tasks can solve many business problems. For example, they can be used to predict coupon redemption rates based on the face value, distribution method, and distribution volume, or to predict wind velocities based on temperature, air pressure, and humidity. data mining is a large of data, so when it collected in database, it must use indexing technique for tuning database.

Forecasting

Forecasting is yet another important data mining task. Database Development in globalization era is the cause a useful data mining algorithm. What will the stock value of MSFT be tomorrow? What will the sales amount of Pepsi be next month? Forecasting can help to answer these questions. It usually takes as an input time series dataset, for example a sequence of numbers with an attribute representing time. The time series data typically contains adjacent observations, which are order-dependant. Forecasting techniques deal with general trends, periodicity, and noisy noise filtering. The most popular time series technique is ARIMA, which stands for AutoRegressive Integrated Moving Average model.

Figure below  contains two curves. The solid line curve is the actual time series data on Microsoft stock value, while the dotted curve is a time series model based on the moving average forecasting technique.

Sequence Analysis

Computer development is have contribute to data mining algorithm. Sequence analysis is used to find patterns in a discrete series. Asequence is composed of a series of discrete values (or states). For example, a DNA sequence is a long series composed of four different states: A, G, C, and T. A Web click sequence contains a series of URLs. Customer purchases can also be modeled as sequence data. For example, a customer first buys a computer, then speakers, and finally a Webcam. Both sequence and time series data contain adjacent observations that are dependant. The difference is that the sequence series contains discrete states, while the time series contains continuous numbers.  Data mining tasks of sequence analysis is very complex.

Sequence and association data are similar in the sense that each individual case contains a set of items or states. The difference between sequence and association models is that sequence models analyze the state transitions, while the association model considers each item in a shopping cart to be equal and independent. With the sequence model, buying a computer before buying Cheese Wine, Milk Cake Beer,  Coke Pepsi, Juice, Beef, Donut speakers is a different sequence than buying speakers before a computer. With an association algorithm, these are considered to be the same itemset.

Figure web sequence below displays Web click sequences. Each node is a URL category. Each line has a direction, representing a transition between two URLs. Each transition is associated with a weight, representing the probability of the transition between one URL and the other. Sequence analysis is a relatively new data mining task. It is becoming more important mainly due to two types of applications: Web log analysis and DNA analysis. There are several different sequence techniques available today such as Markov chains. Researchers are actively exploring new algorithms in this field. Figure 1.6 displays the state transitions among a set of URL categories based on Web click data.

Deviation Analysis

Deviation analysis is for finding those rare cases that behave very differently from others. It is also called outlier detection, which refers to the detection of significant changes from previously observed behavior. Deviation analysis can be used in many applications. The most common one is credit card fraud detection. To identify abnormal cases from millions of transactions is a very challenging task. Other applications include network intrusion detection, manufacture error analysis, and so on. Using this algorithm is depend on the types of computers.

There is no standard technique for deviation analysis. It is still an actively researched topic. Usually analysts employ some modified versions of decision trees, clustering, or neural network algorithms for this task. In order to generate significant rules, analysts need to oversample the anomaly cases in the training dataset.

The implementation of data mining can use OLE DB for data mining, this approach could make data mining easier to implement. Data mining software is one of modern computer applications that support data mining easily.