Minh Viet Le Website

Go to content

Main menu:

Data mining

Tutorials

Important Topics for a Data Mining Course

Prepared by: Minh Viet Le

Caufield School of Information Technology,

Faculty of Information Technology,

Monash University

Date created: May 2006

Last modified date: 15 November 2006

Association Analysis (Basket Analysis Problems)

We assume that we have a set of transactions, each transaction being a list of items (e.g. books). Suppose X and Y appear together in only 1% of the transactions but whenever X appears there is 80% chance that Y also appears.

The 1% presence of X and Y together is called support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule.

The support for X => Y is the probability of both X and Y appearing together, that is P(X & Y).

The confidence of X => Y is the conditional probability of Y appearing, given that X exists, that is P(Y | X).

support(X => Y) = P(X & Y) = (total transactions containing both X and Y) / (total transactions being studied)

confidence(X => Y) = P(Y | X) = (total transactions containing both X and Y) / (total transactions containing X)

Confidence refers to the strength of the association.

Support indicates the frequency of the pattern. A minimum of support is needed if an association is going to be of some business value.

Here is the problem: we want to find all the association rules that have at least p% support with at least q% confidence.

The Apriori algorithm can be employed to solve this problem.

Classification Problems (Supervised Learning)

On Line Analytical Processing (OLAP) Questions	Data Mining Questions
Which customers defaulted on their mortgages last two years?	Which customers are likely to be bad credit risks?
Which customers switched to other phone companies last year?	Which customers are likely to switch to the competition next year?
Which salespersons sold more than their quota during last four quarters?	Which salespersons are expected to exceed their quotas next year?
Which IT subjects were demanded in the job market in the past two years?	Which IT subjects at university are likely to attract Year 12 students?
Last year, which stores exceeded the total prior year sales?	For the next two years, which stores are likely to have best performance?
What were the sales by territory last quarter compared to the targets?	What are the anticipated sales by territory and region for next year?
Who are our top 100 best customers for the last three years?	Which 100 customers offer the best profit potential?
Last year, which were the top five promotions that performed well?	What is the expected returns for next year’s promotions?
Which oil prices range is reasonable?	Which is likely to reduce oil prices?

Here is the problem: separation of objects or ordering of objects into classes. Usually a set of classes is pre-defined. A set of training samples is used for building class models.

Applications:

Applications in the Telecommunications Industry (e.g. service quality analysis, customers' profile)
Applications in Banking and Finance (e.g. credit approval analysis, financial data analysis)
Applications in Pharmaceutical Industry and Medicine (e.g. drug discovery, diseases diagnosis)

Estimation Problems

Estimate the house price ranges that current young couples can afford to buy.
Estimate the time it takes for the reserve bank to change its interest rate.
Estimate the likelihood that the 100% students succeed in completing their degrees within the scheduled time.
Estimate the likelihood that a credit card has been stolen.

Prediction Problems

Prediction problems are essentially the same as classification and estimation but involves future behaviour. Historical data is used to build a model explaining behaviour (outputs) for known inputs. The model developed is then applied to current inputs to predict future outputs. The followings list some of the prediction problems.

Financial forcasting
Exchange rate forecasting
Futures price forecasting
Stock performance and selection prediction
Sales forecasting
Premium pricing
Predict which customers will respond to a promotion

The following data mining techniques are often used to sovle prediction problems.

Linear and multiple regression analysis
Nonlinear regression analysis (e.g. neural networks)

Clustering Analysis (Unsupervised Clustering Problems)

Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields). In clustering there are no pre-defined classes. A similarity measure is used to group records. The user must attach meaning to the clusters formed.

Clustering often precedes some other data mining task. For example, once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster.

The followings show current clustering analysis methods

Partitioning methods

K-means algorithm

K-means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data.

K-medoids algorithm

Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods

Current Applications Employed by Current Data Mining Techniques

Text databases (e.g. new knowledge discovery, searching for interesting patterns)
Time-series and sequence data (e.g. stock market analysis, share prices analysis)
Spatial databases mining
Multimedia databases mining
Web mining (e.g. customised client/customer's website interface)
Data mining for retail industry (e.g. association rules, market basket analysis)
Mining social networks
Graph mining
Streaming data mining

A Summary of Current Data Mining Techniques

Apriori algorithm (association rules analysis)
Neural networks (backpropagation and Self Organising Map (SOM))
Bayesian classification
Decision trees
Genetic algorithms
K-Means and K-Medoids algorithms
Support vector machines
Linear regression
Non-linear regression

Some Practical Research Questions

What new knowledge of customers can we discover from 350 million annual transactions handled by the UK’s largest credit card company?
What new knowledge of customers can we discover from 200 million daily long distance phone calls handled by AT & T?
What is a reliable method to measure the interestingness in rules found from association analysis techiques?

References

Gopal Gupta, "Lecture Notes in Data Mining," Faculty of Information Technology, Monash University, Australia, 2004.
Paulraj Ponniah, "Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals," Chichester : Wiley, 2001.
Jaiwei Han & Micheline Kamber, "Data Mining: Concepts and Techniques," San Francisco, Calif. : Morgan Kaufmann, 2001.
Jaiwei Han & Micheline Kamber, "Data Mining: Concepts and Techniques," 2nd ed, San Francisco, Calif. : Morgan Kaufmann, 2006.
David J. Hand, "Statistics and Data Mining: Intersecting Disciplines," ACM SIGKDD, vol. 1, no. 1, pp. 16-19, June 1999.

These pages are best viewed at a resolution of 1024 X 768. | minhvaanh@gmail.com