Minh Viet Le Website


Go to content

Data mining

Tutorials

Important Topics for a Data Mining Course

Prepared by: Minh Viet Le
Caufield School of Information Technology,
Faculty of Information Technology,
Monash University
Date created: May 2006
Last modified date: 15 November 2006
  1. Association Analysis (Basket Analysis Problems)
  2. We assume that we have a set of transactions, each transaction being a list of items (e.g. books). Suppose X and Y appear together in only 1% of the transactions but whenever X appears there is 80% chance that Y also appears.

    The 1% presence of X and Y together is called support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule.

    The support for X => Y is the probability of both X and Y appearing together, that is P(X & Y).

    The confidence of X => Y is the conditional probability of Y appearing, given that X exists, that is P(Y | X).

    support(X => Y) = P(X & Y) = (total transactions containing both X and Y) / (total transactions being studied)

    confidence(X => Y) = P(Y | X) = (total transactions containing both X and Y) / (total transactions containing X)

    Confidence refers to the strength of the association.

    Support indicates the frequency of the pattern. A minimum of support is needed if an association is going to be of some business value.

    Here is the problem: we want to find all the association rules that have at least p% support with at least q% confidence.

    The Apriori algorithm can be employed to solve this problem.


  3. Classification Problems (Supervised Learning)
  4. On Line Analytical Processing (OLAP) Questions Data Mining Questions
    Which customers defaulted on their mortgages last two years? Which customers are likely to be bad credit risks?
    Which customers switched to other phone companies last year? Which customers are likely to switch to the competition next year?
    Which salespersons sold more than their quota during last four quarters? Which salespersons are expected to exceed their quotas next year?
    Which IT subjects were demanded in the job market in the past two years? Which IT subjects at university are likely to attract Year 12 students?
    Last year, which stores exceeded the total prior year sales? For the next two years, which stores are likely to have best performance?
    What were the sales by territory last quarter compared to the targets? What are the anticipated sales by territory and region for next year?
    Who are our top 100 best customers for the last three years? Which 100 customers offer the best profit potential?
    Last year, which were the top five promotions that performed well? What is the expected returns for next year’s promotions?
    Which oil prices range is reasonable? Which is likely to reduce oil prices?

    Here is the problem: separation of objects or ordering of objects into classes. Usually a set of classes is pre-defined. A set of training samples is used for building class models.

    Applications:

    1. Applications in the Telecommunications Industry (e.g. service quality analysis, customers' profile)
    2. Applications in Banking and Finance (e.g. credit approval analysis, financial data analysis)
    3. Applications in Pharmaceutical Industry and Medicine (e.g. drug discovery, diseases diagnosis)

  5. Estimation Problems

  6. Unlike the classification problems that handle categorical attributes, estimation problems deal with problems that require numerical outputs. The followings list some of the estimation problems.
    • Estimate the house price ranges that current young couples can afford to buy.
    • Estimate the time it takes for the reserve bank to change its interest rate.
    • Estimate the likelihood that the 100% students succeed in completing their degrees within the scheduled time.
    • Estimate the likelihood that a credit card has been stolen.

  7. Prediction Problems
  8. Prediction problems are essentially the same as classification and estimation but involves future behaviour. Historical data is used to build a model explaining behaviour (outputs) for known inputs. The model developed is then applied to current inputs to predict future outputs. The followings list some of the prediction problems.

    • Financial forcasting
    • Exchange rate forecasting
    • Futures price forecasting
    • Stock performance and selection prediction
    • Sales forecasting
    • Premium pricing
    • Predict which customers will respond to a promotion

    The following data mining techniques are often used to sovle prediction problems.

    1. Linear and multiple regression analysis
    2. Nonlinear regression analysis (e.g. neural networks)

  9. Clustering Analysis (Unsupervised Clustering Problems)
  10. Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields). In clustering there are no pre-defined classes. A similarity measure is used to group records. The user must attach meaning to the clusters formed.

    Clustering often precedes some other data mining task. For example, once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster.

    The followings show current clustering analysis methods

    • Partitioning methods
      1. K-means algorithm
        • K-means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data.
      2. K-medoids algorithm
    • Hierarchical methods
    • Density-based methods
    • Grid-based methods
    • Model-based methods

  11. Current Applications Employed by Current Data Mining Techniques
    1. Text databases (e.g. new knowledge discovery, searching for interesting patterns)
    2. Time-series and sequence data (e.g. stock market analysis, share prices analysis)
    3. Spatial databases mining
    4. Multimedia databases mining
    5. Web mining (e.g. customised client/customer's website interface)
    6. Data mining for retail industry (e.g. association rules, market basket analysis)
    7. Mining social networks
    8. Graph mining
    9. Streaming data mining

  12. A Summary of Current Data Mining Techniques
    • Apriori algorithm (association rules analysis)
    • Neural networks (backpropagation and Self Organising Map (SOM))
    • Bayesian classification
    • Decision trees
    • Genetic algorithms
    • K-Means and K-Medoids algorithms
    • Support vector machines
    • Linear regression
    • Non-linear regression

  13. Some Practical Research Questions
    1. What new knowledge of customers can we discover from 350 million annual transactions handled by the UK’s largest credit card company?
    2. What new knowledge of customers can we discover from 200 million daily long distance phone calls handled by AT & T?
    3. What is a reliable method to measure the interestingness in rules found from association analysis techiques?

  14. References
    1. Gopal Gupta, "Lecture Notes in Data Mining," Faculty of Information Technology, Monash University, Australia, 2004.
    2. Paulraj Ponniah, "Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals," Chichester : Wiley, 2001.
    3. Jaiwei Han & Micheline Kamber, "Data Mining: Concepts and Techniques," San Francisco, Calif. : Morgan Kaufmann, 2001.
    4. Jaiwei Han & Micheline Kamber, "Data Mining: Concepts and Techniques," 2nd ed, San Francisco, Calif. : Morgan Kaufmann, 2006.
    5. David J. Hand, "Statistics and Data Mining: Intersecting Disciplines," ACM SIGKDD, vol. 1, no. 1, pp. 16-19, June 1999.

These pages are best viewed at a resolution of 1024 X 768. | minhvaanh@gmail.com

Back to content | Back to main menu