information infrastructure ii (data mining)predrag/classes/2010springi211/week1_w… · late...

29
Information Infrastructure II (Data Mining) I211 Spring 2010

Upload: dangkien

Post on 30-Apr-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Information Infrastructure II(Data Mining)

I211

Spring 2010

Page 2: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Basic InformationBasic Information

Class meets:Ti MW 9 30 10 45Time: MW 9:30am – 10:45amPlace: I2 130

Instructor:Instructor:Predrag RadivojacOffice: Informatics 219Email: [email protected] b i f ti i di d / dWeb: www.informatics.indiana.edu/predrag

Office Hours:Time: MW 2:00pm-3:30pmTime: MW 2:00pm 3:30pmPlace: Informatics 219

Course Web Site:h // i f i i di d / d / l /2010 i i211/2010 i i211 hhttp://www.informatics.indiana.edu/predrag/classes/2010springi211/2010springi211.htm

Page 3: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Basic InformationBasic Information

Associate Instructors:

Michael ConoverLab Section: 8180/8181 (Office hours TBA)Email: [email protected]: TBATime: Check his profile at Informatics web site

Rajeswari SwaminathanjLab Section: 8181/8180 (Office hours TBA) Email: [email protected] Office: TBATime: Check her profile at Informatics web site

Page 4: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Textbook InformationTextbook Information

Good readings for programming:g p g g• MATLAB primer - by Timothy Davis and Kermit

Sigmon (freely accessible at IUCAT)

St l d d diStrongly recommended readings:• Introduction to Data Mining - by Pang-Ning

Tan, Michael Steinbach, and Vipin Kumar

Supplementary material will be provided in class!

Page 5: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Also good readings...Also good readings...

• Data Mining: Concepts and Techniques - by Jiawei Han and Micheline KamberJiawei Han and Micheline Kamber

• Data Mining: Practical Machine Learning Tools and Techniques - by Ian Witten and Eibe Frankand Techniques by Ian Witten and Eibe Frank

• Principles of Data Mining - by David Hand, Heikki Manilla, and Padhraic Smyth (freely accessible at IUCAT)accessible at IUCAT)

Page 6: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Lecture SlidesLecture Slides

I d i D Mi i b P Ni• Introduction to Data Mining - by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

• Data Mining: Concepts and Techniques - by Jiawei Han and Micheline Kamber

• Summary: Our own slides + some mix from the• Summary: Our own slides + some mix from the slides for the books above

Page 7: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Overview of the CourseOverview of the Course

See online syllabus…

introduction to Matlab and Matlab programmingintroduction to data miningintroduction to data miningdata representation and data preprocessingdata visualizationconcepts from linear algebraconcepts from linear algebraconcepts from probability and information theorymining association rulesclassification and regression methodsgclustering techniquescase studies on various types of dataand more... (how much? we’ll see!)

Page 8: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Goal of the Course!!!Goal of the Course!!!

• This course is designed to introduce basic concepts of Data Mining and provide hands-on experience to data analysis, clustering, and prediction by using Matlab.analysis, clustering, and prediction by using Matlab.

• The students will be expected to develop a basic d t di f D t Mi i d d l kill t lunderstanding of Data Mining and develop skills to solve

practical problems.

• Data Mining is a practical discipline that aims to identify interesting new relationships and patterns hidden in numerous databases and real life.numerous databases and real life.

Page 9: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Hidden Goals of the CourseHidden Goals of the Course

• To appreciate fundamental mathematical concepts

T i t b t ti d t t b f id f it• To appreciate abstraction and to not be afraid of it

• To be able to understand how to transfer solutions fromTo be able to understand how to transfer solutions from one set of problems to another set of problems (through abstraction)

• To learn to be a problem solver

• To recruit undergraduate researchers

Page 10: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

What do I assume?

• You have taken I210 and you can program in C, Java, or Python…Python…

• You have basic mathematical skills (e.g. calculus)

• You are patient

What would I like to see?

• You are motivated to learn

• You are motivated to succeed in this class

Page 11: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Grading Policy and AnnouncementsGrading Policy and Announcements

• Midterm exam (1): 20%• Final exam (1): 20%

H k i t (6 7) 40%• Homework assignments (6-7): 40%• Class participation and quizzes (4): 20%____________________________________

• Midterm exam – week 7 (in class), maybe week 8• Final exam – May 5 (8:00am)• Spring break – March 15-19• MLK Jr day – January 18 (no classes!)MLK Jr. day January 18 (no classes!)

Page 12: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Late Assignment Policy and A d i H tAcademic Honesty

Th h k i t d th ifi d d d t• The homework assignments are due on the specified due date through Oncourse

• Late assignments will be accepted (unless there are legitimate i ) i h f ll i lcircumstances) using the following rules– points (on time)– points x 0.9 (1 day late)– points x 0 7 (2 days late)

} recommended!

points x 0.7 (2 days late)– points x 0.5 (3 days late)– points x 0.3 (4 days late)– points x 0.1 (5 days late)

0 ( ft 5 d )

not recommended!

– 0 (after 5 days)

• All assignments are individual!!!

• All the sources used for problem solution must be acknowledged (people, web sites, books, etc.)

Page 13: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Late Assignment Policy and A d i H tAcademic Honesty

G di t f i th l ill A l ill b• Grading: top performers in the class will earn A, class average will be about B

• Distributions of scores will be generated after assignments; regularlyDistributions of scores will be generated after assignments; regularly

• You will know where you stand in the class, if you don’t - ask instructor

D t t l t I’ W’• Do not expect late I’s or W’s

• Read Code of Student Rights, Responsibilities, and Conduct !!!http://www indiana edu/~code/– http://www.indiana.edu/~code/

– Many interesting things there, including that… Students are responsible to “Facilitate the learning environment and the process of learning, including attending class regularly completing class assignments and coming to class prepared”class regularly, completing class assignments, and coming to class prepared .

– Students ≠ Customers (not for instructors, yes for administration)

Page 14: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some Specifics for I211Some Specifics for I211

D t d i t t ( )• Do not record instructor(s)

• Turn off cell phones smart phones and other similar devices during• Turn off cell phones, smart phones, and other similar devices during class

• Use laptops only if you have to

• Be considerate

Page 15: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

OK Let’s Start!OK, Let s Start!

Page 16: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Why mine data? Commercial viewpoint.

• Lots of data is being collected– web data, e-commerce

h t d t t/– purchases at department/grocery stores

– bank/credit card transactionsdi l d t bi l i l d t– medical data, biological data

– streaming data– agricultural data

• Computers have become cheaper and more powerfulComputers have become cheaper and more powerful

• Competitive pressure is strongCompetitive pressure is strong – provide better, customized services for an edge

Page 17: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Why mine data? Scientific viewpoint.

• Data collected and stored at enormous speeds (GB/hour)– remote sensors on a satellite

– telescopes scanning the skies

microarrays generating gene– microarrays generating gene expression data

– scientific simulations generating terabytes of data

f f• Traditional techniques infeasible for raw data

• Data mining may help scientists• Data mining may help scientists– in classifying and segmenting data– in hypothesis generation

Page 18: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Mining large data sets - Motivation• There is often information “hidden” in the data that is

not readily evident• Human analysts may take weeks to discover useful• Human analysts may take weeks to discover useful

information• Much of the data is never analyzed

3,000,000

3,500,000

4,000,000

The Data Gap

1 500 000

2,000,000

2,500,000The Data Gap

Total new disk space (TB) since 1995

500,000

1,000,000

1,500,000

Number of analysts

01995 1996 1997 1998 1999

y

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Page 19: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

More on the data explosion...

• We’re drowning in information and starving for knowledge.

Rutherford D Rogers, librarian, Yale

(NY Times 25 Feb 85)

In Data Mining we typically say• We are drowning in data, but starving for knowledge!

We needA t t d l i f i d t t• Automated analysis of massive data sets

Page 20: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

What is Data Mining?What is Data Mining?

• Alternative name: Knowledge Discovery from Data (KDD)g y ( )• Many definitions

N i i l i f i li i i l k d i ll– Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Exploration & analysis by automatic or semi-automatic means of– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

• Why Data Mining name?Why Data Mining name?– Gold mining is looking for gold, correct?– A few interpretations

Page 21: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some (not so useful) patterns...Some (not so useful) patterns...

• “rules” for American presidents (before 2004 elections)• rules for American presidents (before 2004 elections)

Page 22: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some (not so useful) patterns...Some (not so useful) patterns...

• “rules” for American presidents (before 2004 elections)• rules for American presidents (before 2004 elections)

– if the Washington Redskins win their last home game before the election the incumbent’s party willgame before the election, the incumbent s party will be re-elected

Page 23: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some (not so useful) patterns...Some (not so useful) patterns...

• “rules” for American presidents (before 2004 elections)• rules for American presidents (before 2004 elections)

– if the Washington Redskins win their last home game before the election the incumbent’s party willgame before the election, the incumbent s party will be re-elected

– no Republican has ever won a presidential election p pwithout carrying Ohio

Page 24: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some (not so useful) patterns...Some (not so useful) patterns...

• “rules” for American presidents (before 2004 elections)• rules for American presidents (before 2004 elections)

– if the Washington Redskins win their last home game before the election the incumbent’s party willgame before the election, the incumbent s party will be re-elected

– no Republican has ever won a presidential election p pwithout carrying Ohio

– no incumbent with a four-letter last name has ever been re elected (Polk Taft Ford B sh Sr )been re-elected (Polk, Taft, Ford, Bush Sr.)

Page 25: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Some (not so useful) patterns...Some (not so useful) patterns...

• “rules” for American presidents (before 2004 elections)• rules for American presidents (before 2004 elections)

– if the Washington Redskins win their last home game before the election the incumbent’s party willgame before the election, the incumbent s party will be re-elected

– no Republican has ever won a presidential election p pwithout carrying Ohio

– no incumbent with a four-letter last name has ever been re elected (Polk Taft Ford B sh Sr )been re-elected (Polk, Taft, Ford, Bush Sr.)

– Americans won’t unseat a wartime President

Page 26: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

What is NOT Data Mining?

YES Data Mining– Certain names are more

NOT Data Mining– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– look up phone number in phone directory– query a Web search engine )

– Group together similar documents returned by search engine according to their context

for information about “Amazon”– SQL query processing g g

(e.g. Amazon rainforest, Amazon.com, Amazons – all female warriors in Greek

th l )

– statistical tools

– data visualization?mythology)– data warehousing?

Page 27: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Data Mining: confluence of disciplines

DatabaseDatabase Technology Statistics

Data MiningMachineLearning

VisualizationgLearning

PatternPatternRecognition

AlgorithmsOther

Disciplines

Page 28: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Data Mining and Business Intelligence

Increasing potentialto supportto supportbusiness decisions End User

Decision Making

BusinessAnalyst

D t

Data PresentationVisualization Techniques

Data Mining DataAnalyst

Data MiningInformation Discovery

Data Exploration

DBA

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data WarehousesDBAData Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Page 29: Information Infrastructure II (Data Mining)predrag/classes/2010springi211/week1_w… · Late Assignment Policy and Ad iH tAcademic Honesty • Th h k i t d th ifi d d d tThe homework

Data Mining tasksData Mining tasks

• Predictive methods• Predictive methods– Use some variables to predict unknown or future values of

other variables

• Descriptive methodsFind human interpretable patterns that describe the data– Find human-interpretable patterns that describe the data

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996