data mining with sql server 2008
DESCRIPTION
What is Data Mining? Why? Discovering relationships Predict future events Usage scenarios AlgorithmsTRANSCRIPT
SQL Server 2008 for Business IntelligenceUTS Short Course
Specializes in
C# and .NET (Java not anymore)
TestingAutomated tests
Agile, ScrumCertified Scrum Trainer
Technology aficionado • Silverlight• ASP.NET• Windows Forms
Peter Gfader
Attendance You initial sheet
Hands On Lab You get me to initial sheet
Homework
Certificate At end of 5 sessions If I say if you have completed successfully
Admin Stuff
Course Timetable & Materials
http://www.ssw.com.au/ssw/Events/2010UTSSQL/
Resources
http://sharepoint.ssw.com.au/Training/UTSSQL/
Course Website
Course OverviewSession
Date Time Topic
1Tuesday14-09-2010
18:00 - 21:00
SSIS and Creating a Data Warehouse
2Tuesday21-09-2010
18:00 - 21:00
OLAP – Creating Cubes and Cube Issues
3Tuesday28-09-2010
18:00 - 21:00 Reporting Services
4Tuesday05-10-2010
18:00 - 21:00
Alternative Cube Browsers
5 Tuesday12-10-2010
18:00 - 21:00 Data Mining
1. Other cube browsers Microsoft Data Analyzer Proclarity Excel 2003/2007/2010 Excel services Thinslicer Performance Point Power Pivot
Last week(s)
1. Create report on top of Northwind
Top 10 customers (Table)Top 10 products (Table)Top 10 employees (Table)1 chart that shows the top 10 customers1 usage of the gauge control (surprise me)
Homework
The plan
1. Create Data Warehouse
2. Copy data to data warehouse
3. Create OLAP Cubes
4. Create Reports
5. Browse the cube
6. Do some Data Mining Discovering relationships Predict future events
Step by step to BI
1. What is Data Mining?
2. Why?
3. Uses
4. Algorithms
5. Demo
6. Hands on Lab
Agenda
What is Data Mining?
“Data mining is the use of powerful software tools
to discover significant traits or relationships, from databases or data warehouses and
often used to predict future events”
What is Data Mining?
It exploits statistical algorithms
Once the “knowledge” is extracted it:
Can be used to discover Can be used to predict values of other cases
Marketing Who picks the movie? The kids, the wife, me Who are our Customers and what sort of films do
they hire? Is a 30 year old woman with 2 children going to hire
Arnie’s latest film Validation
Is this data sensible? Terminator 2 and Toy Story Prediction
Sales Next Year
Why Data Mining?
1. Get new information from data, future trends, past trends, outlier, maximums, minimums
2. Analyse data from different perspectives and summarizing it into useful information
3. New information to
increase revenuecuts costsor both :-)
Why? Its all about money
Who are our biggest customers?
What are customers buying with cigars?
What are the customer retention levels of our branches?
Which customers have bought olives, feta cheese but no ciabatta bread?
Which regions have the highest male/female ratio of single 20 somethings?
Which region has lowest customer retention levels and list out lost customers?
Which Questions are Data Mining?
Ad hoc query
Drill through to details
Business Intelligence tool
What’s not data mining
Huge amount of data
Good raw material good data mining
Samples should be representative
Samples "similar" to domain
Not all-seeing crystal ball
Verify and Validate!
Data - Uncover patterns in samples
OLAP Is about fast ad hoc querying Analysis by dimensions and measures Gives precise answers
Data Mining May use RDBMS or OLAP source Is about discovering and predicting Gives imprecise answers
OLAP is not a prerequisite for data mining, but it almost always comes first
OLAP versus Data Mining
(learning to ride a bike before a car)
Classification algorithms
predict one or more discrete variables, based on the other attributes in the dataset
Regression algorithms
predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset
Segmentation algorithms
divide data into groups, or clusters, of items that have similar properties
Association algorithms
find correlations between different attributes in a dataset
Sequence analysis algorithms
summarize frequent sequences or episodes in data, such as a Web path flow
Types of Data Mining Algorithms
Complete Set Of AlgorithmsWays to analyze your data
Decision Trees Clustering Time Series
Neural Network Association Naïve Bayes
Linear Regression Logistic Regression
Sequence Clustering
Split data
Each of branch is like an attribute
Brightness = amount of data
Decision trees
Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables
The process of building is recursive partitioning – splitting data into partitions and then splitting it up more
Initially all cases are in one big box
Decision Trees (1)
The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable Several measures of purity
Then it repeats splitting for each new class Again testing all possible breaks
Unuseful branches of the tree can be pre-pruned or post-pruned
Decision Trees (2)
Decision trees are used for classification and prediction
Typical questions:
Predict which customers will leave Help in mailing and promotion campaigns Explain reasons for a decision What are the movies young female customers like
to buy?
Decision Trees (3)
Decision Trees – Who Decides
Bayes Formula
Uses statistics to say falls into certain category or not with probability
Spam filtering: score of spam (Bayes)
Testing only a particular attribute
Naïve Bayes
Quickly builds mining models that can be used for classification and prediction
It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute This can later be used to predict an outcome
of the predicted attribute based on the known input attributes
This makes the model a good option for exploring the data
Naïve Bayes
Grouping data into clusters Objects within a cluster have high similarity based on the
attribute values
The class label of each object is not known
Several techniques Partitioning methods Hierarchical methods Density based methods Model based methods And more…
Cluster Analysis (1)
Segments a heterogeneous population into a number of more homogenous subgroups or clusters
Some typical questions:
Discover distinct groups of customers Identification of groups of houses in a city In biology, derive animal and plant taxonomies Find outliers
Cluster Analysis (2)
Clustering
Age
Annual
Income
Time series
Timebased data prediction
Sequence clustering
Numbers orders stronger associations
Direction of association (not necessary the other direction)
If you own certain stocks ' you own maybe other ones as well
Probability = thickness of line
Association
Let system learn how to classify data
Neural Network adapts to the new data
Formulate statement/hypothesis
Outcome is know
(Data / Surveys)
1. 70% data to train network (outcome is known) 2. 30% of data to test network (outcome is known) 3. New data (no survey needed, predict from network)
Other example: OCR
Neural Nets
Conclusion: When To Use What
Task Microsoft algorithms to use
Predicting a discrete attribute. For example, predict whether the recipient of a targeted mailing campaign will buy a product.
Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm
Predicting a continuous attribute. For example, forecast next year's sales.
Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm
Predicting a sequence. For example, perform a clickstream analysis of a company's Web site.
Microsoft Sequence Clustering Algorithm
Finding groups of common items in transactions. For example, use market basket analysis to suggest additional products to a customer for purchase.
Microsoft Association Algorithm Microsoft Decision Trees Algorithm
Finding groups of similar items. For example, segment demographic data into groups to better understand the relationships between attributes.
Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm
Visual Numerics
3rd party algorithms
http://www.vni.com/company/whitepapers/ MicrosoftBIwithNumericalLibraries.pdf
There is more...
Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007
http://www.microsoft.com/downloads/en/details.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en
Excel Data Mining
Train station / airport
Who is the bad guy
Farmers
Find the best crops
Supermarket
Find to figure out how to get you to buy more, where the expensive items
Other usages of data miningFind patterns - Profiling
SSIS 2008 - Data profiling task
Get a profile of the data in a table
potential candidate keys length of data values in columns Null percentage of rows distribution of values ....
Tip
Video: Simple data mining model
http://www.sqlservercentral.com/articles/Video/65055/
Video: Data mining and Reporting Services
http://www.sqlservercentral.com/articles/Video/64190/
Data Mining Algorithms
http://msdn.microsoft.com/en-us/library/ms175595.aspx
Resources 1
Jamie MacLennan
http://blogs.msdn.com/b/jamiemac/
Richard Lees on BI
http://richardlees.blogspot.com/
Book Data Mining with Microsoft SQL Server 2008
http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742
Resources 2
Why Data Mining? Uses Algorithms Demo Hands on Lab
Summary
Thank You!
Gateway Court Suite 10 81 - 91 Military Road Neutral Bay, Sydney NSW 2089 AUSTRALIA
ABN: 21 069 371 900
Phone: + 61 2 9953 3000 Fax: + 61 2 9953 3105