dm online unit 1 p1 data mining concepts pdf

Upload: indiguy141

Post on 10-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    1/36

    CECS 632-50 UNIT 1

    Data Mining Concept

    Mehmed Kantardzic

    Louisville, 2006

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    2/36

    Quest ions about Dat a Min ingWhat is data mining?Why data mining: motivation andbenefits?

    What kind of data to mine?When to mine the data?How to organize the mining process?What are challenges in data mining?

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    3/36

    Trends Leading toData Flood: :

    Bank, telecom, otherbusiness transactions ...

    Scientific data: astronomy,biology, etc

    Web, text, and e-commerce

    Why Dat a Min ing Now ?

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    4/36

    5 m i l l ion t e raby t es c reat ed in 2002 ! UC Berkeley 2003 estimate: 5 exabytes (5 million

    terabytes) of new data was created in 2002.

    www.sims.berkeley.edu/research/projects/how-much-info-2003/

    US produces ~40% of new stored data worldwide.

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    5/36

    Largest Dat abases Commercial databases:

    Winter Corp. 2003 Survey: France Telecom has largestdecision-support DB, ~30TB; AT&T ~ 26 TB

    Europe's Very Long Baseline Interferometry (VLBI) has 16telescopes, each of which produces 1 Gigabit/second ofastronomical data over a 25-day observation session:

    Web

    Alexa internet archive: 7 years of data, 500 TB Google searches 4+ Billion pages, many hundreds TB IBM WebFountain, 160 TB (2003) Internet Archive (www.archive.org),~ 300 TB

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    6/36

    Do you need m ore ex am ples ? MEDLINE text database

    12 million published articles

    Google 4.2 billion Web pages indexed 80 million site visitors per day

    CALTRANS loop sensor data Every 30 seconds, thousands of sensors, 2Gbytes per second

    NASA MODIS satellite Coverage at 250m resolution, 37 bands, whole earth, every day

    Walmart transaction data Order of 100 million transactions per day

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    7/36

    Why Dat a Min ing Now ? Data Explosion causes Data Wasting:

    Only a small portion (5% - 10%) of the collected data is everanalyzed.

    Data that may be never analyzed continues to be collected atgreat expenses.

    WE ARE DROWNING IN DATA, BUTSTARWING FOR KNOWLEDGE!

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    8/36

    Why Dat a Min ing Now ?Sources of data overload: Distributed data sources

    Remote sensing Exponential growth Internet of digital information Multimedia data Internet 2 107

    . hosts

    4 105

    _______________________________ 1988___________2000_____

    Data size and dimensionality are too large for manual analyses

    and interpretation.

    There exists a gap between data collection and organizationcapabilities, and abilities to analyze large data sets and extract

    useful information for decision processes.

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    9/36

    Managers Bel ieve 61% believe that information overload is present in their workplace.

    80% believe the situation will get worse. 50% ignore large data sets in current decision process. 84% store the data for future with current use or analysis.

    60% believe that the cost of gathering informationoutweights its value!!!!!!!

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    10/36

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    11/36

    Why Mine Dat a Now ?Com m erc ial V iew point

    Lots of data is being collectedand warehoused

    Web data, e-commerce purchases at department/

    grocery stores

    Bank/Credit Cardtransactions

    Computers have become affordable and more

    powerful Competitive Pressure is Strong

    Provide better, customized services for an edge, andinformation is becoming product on its own right.

    Data Mining may help?

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    12/36

    Why Mine Dat a Now ?Sc ien t i f i c V iew point

    Data collected and stored at

    enormous speeds (GB/ hour) remote sensors on a satellite telescopes scanning the skies

    microarrays generating geneexpression data scientific simulations

    generating terabytes of data

    Traditional techniques infeasible for raw data Data mining may help scientists in new discoveries

    in classifying and segmenting data, detecting patterns in hypothesis formation

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    13/36

    Data Mining Now:

    Opportunity and ChallengesData Mining Now:Data Mining Now:

    Opportunity and ChallengesOpportunity and Challenges

    Data RichKnowledge Poor(theresource)

    Enabling Technology(New sensors, OLAP,parallel computing, Web, etc.)

    CompetitivePressure

    Data MiningTechnologyMature

    DM

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    14/36

    What is dat a m in ing?The magic phrase used to ....

    put in your resume

    use in a proposal to NSF, NIH, NASA, etc market database software

    sell statistical analysis software

    sell parallel computing hardware sell consulting services

    make refuge from the collapse of some AI promises

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    15/36

    Dat a Min ing isa t a Min ing is NOTOT Brute-force crunching of bulk data

    Blind application of algorithms

    Going to find relationships where

    none exist

    Presenting data in different ways

    A database intensive task

    A difficult to understand technology

    requiring an advanced degree in

    computer science

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    16/36

    Also Dat a Min ing islso Dat a Min ing is NOTOT

    Data warehousing SQL / Ad Hoc Queries /

    Reporting

    Software Agents Online Analytical Processing

    (OLAP)

    Data Visualization

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    17/36

    What Is Dat a Min inghat Is Dat a Min ing? In many domains there is a shift from classical modeling and

    analyses based on first principleto developing models andcorresponding analyses directly from data.

    DATA MINING PROCESS

    Data Model

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    18/36

    What Is Dat a Min ing?Data Mining is a process for the automaticextraction of non-obvious, hidden knowledge fromlarge volumes of data.

    106-1012 bytes:never see the wholedata set or put it in the

    memory of computers

    Data miningprocess?

    What knowledge?How to representand use it?

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    19/36

    What is Dat a Min ing?hat is Dat a Min ing?Potential point of confusion:

    The extracting ore from rock metaphor doesnot really apply to the practice of data mining

    If it did, then standard database queries wouldfit under the rubric of data mining

    In practice, DM refers to: finding patterns/models across large

    datasets

    discovering unknown information

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    20/36

    What Is Dat a Min ing?The non-trivial process of identifying valid, novel, potentially useful,and ultimately understandable patterns/models in dataFayyad, Platetsky-Shapiro, Smyth (1996)

    non-trivial process

    Multiple processesAnd iterations

    valid Justified patterns/models

    novel Previouslyunknown

    useful Can be used by end user

    understandableby human and machine

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    21/36

    From Data to KnowledgeFrom Data to KnowledgeFrom Data to Knowledge

    ...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148,712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS

    12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71,59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

    15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47,63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

    16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39,2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS...

    Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes

    Numerical attribute categorical attribute missing values class labels

    IF cell_poly 15THEN Prediction = VIRUS [87,5%]

    predictive accuracy

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    22/36

    Possib le Business Disc over iesTable 1.3 Acme Investors Incorporated

    Customer Account Margin Transaction Trades/ Favorite Annual

    ID Type Account Method Month Sex Age Recreation Income

    1005 Joint No Online 12.5 F 3039 Tennis 4059K1013 Custodial No Broker 0.5 F 5059 Skiing 8099K1245 Joint No Online 3.6 M 2029 Golf 2039K2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K1001 Individual Yes Online 5.0 M 4049 Golf 6079K

    Can I develop a general characterisation/profile of different

    investor types? (CLASSIFICATION)

    What characteristics distinguish between Online and Brokerinvestors? (DISCRIMINATION)

    Can I develop a model which will predict the average

    trades/month for a new investor? (PREDICTION)

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    23/36

    Dat a MiningWhat s in a Nam e?Information Harvesting

    Knowledge Mining

    Data Mining

    Intelligent DataAnalysis Knowledge Discovery

    in DatabasesData Dredging

    Data Pattern ProcessingData Archaeology

    Database Mining

    SiftwareData Fishing

    Knowledge Extraction

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    24/36

    Dat a Min ing Root sStatistics

    driven by the notation of a modelDatabase Technology

    concentration on large amount of dataMachine Learning

    emphasize algorithmsControl Theory

    to predict a systems behavior,- to explain the interactions.

    Artificial Neural NetworksPattern RecognitionChaos TheoryData Visualization

    Its

    Mine!!

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    25/36

    Let the data speak

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    26/36

    Let the data speak

    The data may havequite a lot to say.. but it

    may just be noise!

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    27/36

    Dat a Min ing Proc essSTATE THE PROBLEM

    (COLLECT THE DATA)

    DATA PREPROCESSING

    ESTIMATE THE MODEL

    MODEL INTERPRETATION & CONCLUSIONS

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    28/36

    Other view: Data mining asthe core of knowledgediscovery process.

    Data Cleaning

    Data Integration

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    Pattern Evaluation

    Dat a Min ing & (or) K DD Proc ess

    Databases

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    29/36

    Charac t er is t i c s Of Raw Dat a Missing data,

    Misrecorded data,

    Data may be from the other population(heterogeneous),

    Different structures & formats,

    With or without compression,

    Redundant,

    With implicit temporal & spatial components,

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    30/36

    Dat a Min ing Tec hniquesat a Min ing Tec hn iquesRaw Data = Messy Data

    _________________________________________________________

    ALGORITHMS for PREPROCESSING :

    - Scaling & Normalization- Encoding- Outlier Detection & Removal- Feature Selection & Composition- Data Cleansing & Scrubbing- Data Smoothing- Missing Data Elimination- Sampling

    iP i

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    31/36

    Primary Tasks of Data M iningPrimaryPrimary Tasks of Data M iningTasks of Data M ining

    Classification

    Deviation and

    change detection

    ?

    Summarization

    Clustering

    DependencyModeling

    Regression

    finding the descriptionof several predefinedclasses and classifya data item into one

    of them.

    maps a data item

    to a real-valuedprediction variable.

    identifying a finiteset of categories orclusters to describe

    the data.

    finding acompact descriptionfor a subset of data

    finding a modelwhich describes

    significant dependenciesbetween variables.

    discovering themost significantchanges in the data

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    32/36

    Dat a Min ing Tec hniquesare

    Decision Trees

    Nearest Neighbor Classification

    Neural Networks

    Rule Induction

    K-means Clustering

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    33/36

    Is data mining lot of hammers looking for nails?

    Dat a Min ing Bubble ?

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    34/36

    Pot ent ia l Dat a Mining App l ic a t ionsot en t ia l Dat a Mining App l ic a t ionsBusiness Manufacturing

    Science

    Personal

    - Marketing and salesdata analysis

    - Investment analysis- Loan approval- Fraud detection- etc. - Controlling and scheduling

    - Network management- Sensor monitoring- etc.

    - Gene analysis- Space image classification- Experiment result analysis- etc.

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    35/36

    Dat a Min ing is Spreading FINANCIAL INSTITUTIONS

    RETAIL INDUSTRY

    TELECOMMUNICATION INDUSTRY

    HEALTH INDUSTRY

    SCIENCE & ENGINEERING

    GOVERNMENT

    E-COMMERCE

  • 8/8/2019 DM Online UNIT 1 P1 Data Mining Concepts PDF

    36/36

    Dat a Mining: When & How ?SOME HINTS FOR SUCCESS:

    Business or scientific needs are more importantthan the razzle-dazzle of a technical solution.

    Preparing data is as much as 80% of the miningprocess.

    Dont rely on a single methodology !

    Keep the end-users informed and involved this

    is an interdisciplinary task.

    Data mining is an iterative and interactiveprocess: Be prepared to generate a lot ofgarbage until you hit something that is

    actionable and meaningful, and useful.