l1-data mining concept

Upload: mahmoud-al-qudah

Post on 12-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 L1-Data Mining Concept

    1/21

    Data Mining Concep

  • 7/23/2019 L1-Data Mining Concept

    2/21

    INTRODUCTION

    Modern science and engineering are based ofrst - principle models to describe physical, biand social systems.

    Such an approach starts with a basic scientifcsuch as Newtons laws o motion or Mawells e

    in electromagnetism, and then builds upo"arious applications in mechanical engineeelectrical engineering.

    #n this approach, eperimental data are used the underlying frst - principle models and to e

    some o the parameters that are di$cult or so

  • 7/23/2019 L1-Data Mining Concept

    3/21

    %owe"er, in many domains the underlyiprinciples are un&nown, or the systems under stoo comple to be mathematically ormali'ed.

    #n the digital age we li"e in, a lot o current generates too many data.

    Someone may thin&s how we can utili'e these d

    (here is currently a paradigm shit rom

    modeling and analyses based on frst princde"eloping models and the corresponding directly rom data.

    (he need to understand large, comple, inormarich data sets is common to "irtually all felds o

    business, science, and engineering.

  • 7/23/2019 L1-Data Mining Concept

    4/21

    (he entire process o applying a computer methodology, including new techni!ues, or euseul &nowledge rom data is called data minin

    Data Mining is) an iterati"e process within which progress is defned

    disco"ery, through either automatic or manual metho

    is most useul in an eploratory analysis scenario in wthere are no predetermined notions about what will can * interesting + outcome.

    is the search or new, "aluable, and nontri"ial inormalarge "olumes o data.

    est results are achie"ed by balancing the &nowhuman eperts in describing problems and goalsthe search capabilities o computers.

  • 7/23/2019 L1-Data Mining Concept

    5/21

    Data - Mining cti"ities

    . predicti"e data mining, which produces the model o tdescribed by the gi"en data set.

    /n the predicti"e end o the spectrum, the goal o data mproduce a model, epressed as an eecutable code, whicused to perorm classifcation prediction, estimation, or othtas&s

    0. descripti"e data mining, which produces new, information based on the a"ailable data set.

    /n the descripti"e end o the spectrum, the goal is tounderstanding o the analy'ed system by unco"ering patrelationships in large data sets

  • 7/23/2019 L1-Data Mining Concept

    6/21

    primary data - mining tasks )

    1. Classication :Disco"ery o a predicti"e leunction that classifes a data item into onese"eral predefned classes.

    2. Regression : Disco"ery o a predicti"e learunction that maps a data item to a real - "

    prediction "ariable.. Clustering : common descripti"e tas& in

    one see&s to identiy a fnite set o categorclusters to describe the data.

  • 7/23/2019 L1-Data Mining Concept

    7/21

    Cont.

    !. "ummari#ation : n additional descripti"e tain"ol"es methods or fnding a compact descor a set 1or subset2 o data.

    $. %ependency &odeling :3inding a local modedescribes signifcant dependencies between"ariables or between the "alues o a eature data set or in a part o a data set.

    4. Change and %eviation %etection : Disco"erinmost signifcant changes in the data set.

  • 7/23/2019 L1-Data Mining Concept

    8/21

    D( - M#N#N5 67/C8SS su$ciently broad defnition o data mining)

    %ata mining is a process of discovering various modelssummaries, and derived values from a given collection

    'here is a misconception a(out data-mining. )ny one ccollect data and apply computer-(ased tool that match

    pro(lem. 'his is nor true (ecause :

    #t is not simply a collection o isolated tools

    #t is an iterati"e process. /nly "ery rarely is a !uestion stated su$ciently precisely that a single anapplication o the method will su$ce.

  • 7/23/2019 L1-Data Mining Concept

    9/21

  • 7/23/2019 L1-Data Mining Concept

    10/21

    . State the problem and ormuthe hypothesis.

    Most data - based modeling studies are perormed in a

    application domain. %ence, domain - specifc &nowleeperience are usually necessary in order to come umeaningul problem statement.

    9nortunately, many application studies tend to ocus on- mining techni!ue at the epense o a clear problem sta

    #n this step, a modeler usually specifes a set o "ariablun&nown dependency and, i possible, a general ordependency as an initial hypothesis.

    (here may be se"eral hypotheses ormulated or problem at this stage. (he frst step re!uires the epertise o an application domain and a data - mining m

  • 7/23/2019 L1-Data Mining Concept

    11/21

    0. Collect the data.(his step is concerned with how the data are generated and

    there are two distinct possibilities)1. %esigned e*periment: the data - generation process is und

    control o an epert 1modeler2.

    2. +(servational approach: the e*pert cannot inuence the d-generation process

    .n obser"ational setting, namely, random data geneassumed in most data : mining applications. (ypically, thedistribution is completely un&nown ater data are collectepartially and implicitly gi"en in the data : collection procedu

    .#t is important to ma&e sure that the data used or estimatinmodel and the data used later or testing and applying a morom the same un&nown sampling distribution

  • 7/23/2019 L1-Data Mining Concept

    12/21

    ;. 6reprocess the data. Data preprocessing usually includes at least two comm

    . /utlier detection 1and remo"al2./utliers are unusual data "alues that are not consistent

    most obser"ations. Commonly, outliers result rommeasurement errors, coding and recording errors, and,sometimes are natural, abnormal "alues. Such nonrepresentati"e samples can seriously a

  • 7/23/2019 L1-Data Mining Concept

    13/21

    0. Scaling, encoding, and selecting eatures

    .3or eample, one eature with the range =>, ? other with the range = @ >>, >>>? will not hsame weight in the applied techni!ueA they inBuence the fnal data - mining results di