Part I: Introductory Materials Introduction to Data Mining

Download Part I: Introductory Materials Introduction to Data Mining

Post on 25-Feb-2016

48 views

Category:

Documents

0 download

DESCRIPTION

Part I: Introductory Materials Introduction to Data Mining. Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory. What is common among all of them?. Data. - PowerPoint PPT Presentation

TRANSCRIPT

  • Part I: Introductory MaterialsIntroduction to Data MiningDr. Nagiza F. SamatovaDepartment of Computer ScienceNorth Carolina State UniversityandComputer Science and Mathematics DivisionOak Ridge National Laboratory

  • *What is common among all of them?

  • Who are the data producers? What data?Application DataApplication Category: FinanceProducer: Wall StreetData: stocks, stock prices, stock purchases,

    Application Category: AcademiaProducer: NCSUData: students admission data (name, DOB, GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc.

    *

  • Application Categories Finance (e.g., banks)Entertainment (e.g., games)Science (e.g., weather forecasting)Medicine (e.g., disease diagnostics)Cybersecurity (e.g., terrorists, identity theft)Commerce (e.g., e-Commerce)*

  • What questions to ask about the data?DataQuestionsAcademia:NCSU:Admission dataIs there any correlation between the students GRE scores and their successful completion of a PhD program?What are the groups of students that share common academic performance?Are there any admitted students who would stand out as an anomaly? What type of anomaly is that?If the student majors in Physics, what other major is he/she likely double-major?

    *

  • Questions by Types?Correlation, similarity, comparison,Association, causality, co-occurrence,Grouping, clustering,Categorization, classification,Frequency or rarity of occurrence,Anomalous or normal objects, events, behaviors,Forecasting: future classes, future activity,

    *

  • What information we need to answer?QuestionsData Objects and Object FeaturesAcademia:NCSU:Admission dataObjects: StudentsObjects Features=Variables=Attributes=Dimensions & TypesName:String (e.g., Name=Neil Shah)GPA:Numeric (e.g., GPA=5.0)Recommendation:Text (e.g., the top 2% in my career)Etc.

    *

  • How to compare two objects?Data Object Object PairsAcademia:NCSU:Admission dataObjects: StudentsBased on a single feature: Similar GPAThe same first letter in the last nameBased on a set of features:Similar academic records (GPA, GRE, etc.)Similar demographic recordsCan you compute a numerical value for your similarity measure used for comparison? Why or Why not?

    *

  • How to represent data mathematically?Data Object & its Features Data Model*What mathematical objects have you studied?ScalarPointsVectorsVector spacesMatricesSetsGraphs, networks (maybe)Tensors (maybe)Time series (maybe)Topological manifolds (maybe)

    *

  • Data object as vector with components*City=(Latitude, Longitude)--2-dimensional objectVector components: Features, or Attributes, or DimensionsRaleigh=(35.46, 78.39) Boston=(42.21, 71.5) Proximity(Raleigh, Boston)=? Geodesic distance Euclidean distance Length of the interstate route

  • A set of data objects as vector spaces*3-dimensional vector spaceMining such data ~ studying vector spaces

  • Multi-dimensional vectors*S1=(John Smith, 5.0, 180, 6.0, 200)S2=(Jane Doe, 3.0, 140, 5.4, 70)Vector components: Features, or Attributes, or DimensionsStudent=(Name, GPA, Weight, Height, Income in K, ) - mutli-dimensionalProximity(S1, S2)=? How to compare when vector components are of heterogeneous type, or different scales? How to show the results of the comparison?

  • as matrices*Original Documentst-d term-document matrixTerms=Features=DimensionsParsed DocumentsExample: A collection of text documents on the WebMining such data ~ studying matrices

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    Sheet2

    Sheet3

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    Sheet2

    Sheet3

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    Sheet2

    Sheet3

    MBD01118F01.xls

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    Sheet2

    Sheet3

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    D1:D2:D3:

    T1:001

    T2:100

    T3:001

    T4:100

    T5:011

    T6:101

    T7:011

    Sheet2

    Sheet3

    MBD01118F01.xls

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    Sheet2

    Sheet3

  • or as trees*t-d term-document matrixpresident government party election political elected national districts held district independence vice minister partiespopulation area climate city miles province land topography total season 1999 square rateeconomy million products 1996 growth copra economic 1997 food scale exports rice fishdocumenttermsIs D2 similar to D3?What if there are 10,000 terms?Mining such data ~ studying trees

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    D1:D2:D3:

    T1:001

    T2:100

    T3:001

    T4:100

    T5:011

    T6:101

    T7:011

    Sheet2

    Sheet3

    MBD01118F01.xls

    Sheet1

    D1:Child Safety at Home

    D2:Infant & Toddler First Aid

    D3:Your Baby's Health and Safety: From Infant to Toddler

    D1:Child Safety Home

    D2:Infant Toddler

    D3:Bab Health Safety Infant Toddler

    T1:BabT1:Bab

    T2:ChildT2:Child

    T3:HealthT3:Health

    T4:HomeT4:Home

    T5:InfantT5:Infant

    T6:SafetyT6:Safety

    T7:ToddlerT7:Toddler

    Sheet2

    Sheet3

  • 0r as networks, or graphs w/ nodes & links*population area climate city miles province land topography total season 1999 square ratepresident government party election political elected national districts held district independence vice minister partieseconomy million products 1996 growth copra economic 1997 food scale exports rice fishNodes=DocumentsLinks=Document similarity (e.g., if document references another document )Mining such data ~ studying graphs, or graph mining

  • What apps naturally deal w/ graphs?*Credit: Images are from Google images via search of keywords

  • What questions to ask about graph data?Graph Data Graph Mining QuestionsAcademia:NCSU:Admission dataNodes=students; links=similar academics/demographics How many distinct academically performing groups of students admitted to NCSU?Which academic group is the largest?Given a new student applicant, can we predict which academic group the student will likely belong to?Are groups of student with similar demographics usually share similar academic performance?Over the last decade, has the diversity in demographics of accepted student groups increased or decreased?

    *

  • Recap: Data Mining and Graph Mining*DataApplicationQuestionsData Objects + FeaturesMathematical Data Representation (Data Model)VectorsMatricesGraphsTime seriesTensorsSetsManifoldsNot one hat fits allMore than one models are neededModels are related

  • *How much data?1 TB (TeraByte) 1012 Bytes1 PB (PetaByte) 1015 BytesMy laptop:60 GB (GigaBytes) 109 Bytes

  • *It is not just the Size

  • *Data Describes Complex Patterns/Phenomena How to untangle the riddles of the complexity?

  • *Connecting the DotsSheer Volume of DataClimateNow: 20-40 Terabytes/year5 years: 5-10 Petabytes/yearFusionNow: 100 Megabytes/15 min5 years: 1000 Megabytes/2 minAdvanced Math+AlgorithmsHuge dimensional spaceCombinatorial challengeComplicated by noisy dataRequires high-performance computersProviding Predictive Understanding Produce bioenergy Stabilize CO2 Clean toxic wasteUnderstanding the DotsFinding the DotsConnecting the Dots

  • *Why Would Data Mining Matter? Enables solving many large-scale data problems

  • *How to Move and Access the Data? Technology trends are a rate limiting factorMost of these data will NEVER be touched!J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000Naturally distributed but effectively immovable Streaming/Dynamic but not re-computable Data doubles every 9 months; CPU 18 months.

  • *How to Make Sense of Data?Know Your Limits & Be SmartTo see 1 percent of a petabyte at 10 megabytes per second takes:Ultrascale Computations:Must be smart about which probe combinations to see!

    Physical Experiments:Must be smart about probe placement!Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest.35 8-hour days!

  • *What Analysis Algorithms to Use?Even a simple big O analysis can elucidate simplicity. If n=10GB, then what is O(n) or O(n2) on a teraflop computers?1GB = 109 bytes 1Tflop = 1012 op/sec

    ***To determine whether any simple rules exist which may be used to describe the low dimensional behavior of complex systems.

    Special issue on complex systems: Science Vol. 284. No. 5411 (1999) *The purpose of computing is insight not numbers. No single discipline can independently facilitate the process of elucidating simplicity from complexity. Hence, the DOE Office of Science conducts advanced research at the junction of many disciplines including computer science, mathematics, high performance computing, and an array of application disciplines such as physics, chemistry, biology, environmental and climate science. Our mission is to advance multidisciplinary research to enable application scientists from many different domains to find the dots, connect the dots, and to harness the intricacy of Nature (or to understand the dots) in their data.

    As one program manager put it Everyone knows about looking for that proverbial needle in a hay stack. But imagine a million, million hay stacks. Now imagine a million million planets each containing a million million hay stacks. And they all change over time and may interact in ways you dont understand. And you have no idea what a needle looks like or if there really is one in there.

    For example, Biological systems, such as cells, are inherently complex. This complexity arises from the selective and nonlinear interconnections of functionally diverse components to produce coherent behavior. Computational modeling and simulation that reproduce and predict such behavior form the Holy Grail of systems biology. The key challenge is to reveal underlying simplicity from biological complexity. The fundamental rules (simplicity) that quantify the low dimensional behavior of biological systems are yet to be discovered. A promising approach aims to interrelate emerging disparate and noisy omics observations by relying on mathematics, computer science, information technology, and computing. This will require a systematic framework for finding informative features (identifying the dots) and linking them (connecting the dots) to formulate fundamental principles governing a complex behavior (understanding the dots). This framework is essential for the progress toward predictive understanding of biological systems.

    ***We all understand that it is not humanly possible to sort through and understand a petabyte of data. Researchers need to be smart about taking data and ordering data but they also need advanced tools to help them make sense of their data.

    Scientists need mechanisms to efficiently and effectively search for such patterns in order to better understand and control the intricate laws of Nature.

    *

Recommended

View more >