Main menu:
Tutorials
Prepared by: Minh Viet Le |
Caufield School of Information Technology, |
Faculty of Information Technology, |
Monash University |
Date created: May 2006 |
Last modified date: 15 November 2006 |
We assume that we have a set of transactions, each transaction being a list of items (e.g. books). Suppose X and Y appear together in only 1% of the transactions but whenever X appears there is 80% chance that Y also appears.
The 1% presence of X and Y together is called support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule.
The support for X => Y is the probability of both X and Y appearing together, that is P(X & Y).
The confidence of X => Y is the conditional probability of Y appearing, given that X exists, that is P(Y | X).
support(X => Y) = P(X & Y) = (total transactions containing both X and Y) / (total transactions being studied)
confidence(X => Y) = P(Y | X) = (total transactions containing both X and Y) / (total transactions containing X)
Confidence refers to the strength of the association.
Support indicates the frequency of the pattern. A minimum of support is needed if an association is going to be of some business value.
Here is the problem: we want to find all the association rules that have at least p% support with at least q% confidence.
The Apriori algorithm can be employed to solve this problem.
On Line Analytical Processing (OLAP) Questions | Data Mining Questions |
Which customers defaulted on their mortgages last two years? | Which customers are likely to be bad credit risks? |
Which customers switched to other phone companies last year? | Which customers are likely to switch to the competition next year? |
Which salespersons sold more than their quota during last four quarters? | Which salespersons are expected to exceed their quotas next year? |
Which IT subjects were demanded in the job market in the past two years? | Which IT subjects at university are likely to attract Year 12 students? |
Last year, which stores exceeded the total prior year sales? | For the next two years, which stores are likely to have best performance? |
What were the sales by territory last quarter compared to the targets? | What are the anticipated sales by territory and region for next year? |
Who are our top 100 best customers for the last three years? | Which 100 customers offer the best profit potential? |
Last year, which were the top five promotions that performed well? | What is the expected returns for next year’s promotions? |
Which oil prices range is reasonable? | Which is likely to reduce oil prices? |
Here is the problem: separation of objects or ordering of objects into classes. Usually a set of classes is pre-defined. A set of training samples is used for building class models.
Prediction problems are essentially the same as classification and estimation but involves future behaviour. Historical data is used to build a model explaining behaviour (outputs) for known inputs. The model developed is then applied to current inputs to predict future outputs. The followings list some of the prediction problems.
The following data mining techniques are often used to sovle prediction problems.
Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields). In clustering there are no pre-defined classes. A similarity measure is used to group records. The user must attach meaning to the clusters formed.
Clustering often precedes some other data mining task. For example, once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster.
The followings show current clustering analysis methods