business intelligence session
DESCRIPTION
What is Business Intelligence and its uses as a Business ManagerTRANSCRIPT
Business Intelligence
What is Business Intelligence (BI)
• Business Intelligence (BI) refers to skills, processes, technologies, applications and practices used to support decision making.
• Systems that provide directed background data and reporting tools to support and improve the decision-making process.
• A popularized, umbrella term used to describe a set of concepts and methods to improve business decision making by using fact-based support systems. The term is sometimes used interchangeably with briefing books and executive information systems.
• Business Intelligence is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help clients make better business decisions.
• A system that collects, integrates, analyses and presents business information to support better business decision making.
• Business Intelligence is an environment in which business users receive information that is reliable, secure, consistent, understandable, easily manipulated and timely...facilitating more informed decision making
What is Business Intelligence (BI)
Improving organizations by providing business insights to all employees leading to better, faster, more relevant decisions
What is Business Intelligence (BI)
Business Intelligence enables the business to make intelligent, fact-based decision
Aggregate Data
Database, Data Mart, Data Warehouse, ETL Tools,
Integration Tools
Present Data
EnrichData
Inform a Decision
Reporting Tools, Dashboards, Static
Reports, Mobile Reporting, OLAP Cubes
Add Context to Create Information, Descriptive Statistics, Benchmarks, Variance to Plan or LY
Decisions are Fact-based and Data-driven
Content
The business determines the “what”, BI enables the “how”
Performance
Minimize report creation and collection times (near zero)
Usability
Delivery Method Push vs Pull
Medium Excel, PDF, Dashboard, Cube, Mobile Device
Enhance Digestion Fewer clicks
Tell a Story Trend, Context, Related Metrics, Multiple Views
CPU – Content, Performance, Usability
Why is Business Intelligence So Important?
Time
With Business Intelligence, we can get data to you in a timely manner.
Making Business Decisions is a Balance
Data Opinion
(aka Best Professional Judgment)
In the absence of data, business decisions are often made by the HiPPO.
Core Capabilities of BI
Why do companies need BI?
Tactical / Strategic BI
What’s the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?
Where exactly is the problem?
How many, how often, where?
What happened?
Sophistication of Intelligence
Operational BI
Optimization
Predictive Modeling
Forecasting/extrapolation
Statistical analysis
Alerts
Query/drill down
Ad hoc reports
Standard reportsCo
mp
etit
ive
Ad
van
tag
e
Benefits of Business Intelligence
• Improve Management Processes– planning, controlling, measuring and/or changing resulting in increased revenues and reduced costs
• Improve Operational Processes– fraud detection, order processing, purchasing.. resulting in increased revenues and reduced costs
• Predict the Future
BI Golden Rules
• Data Quality & Accuracy
• Data Consistency
• Data Timeliness
“Get the right information to the right people at the right time”
Gartner BI Maturity Model
Business Intelligence
Reporting
Business Intelligence
Data Mining
Knowledge Management
Expert Systems
Reporting ToolsSoftware that reads data
Processes data by:FilteringSortingGroupingSimple calculations
Produces formatted summaries of data
OLAPOn-Line Analytical Processing
Consolidation
Drill-Down
Slicing and Dicing
Drill Down
Slice and Dice
OLAP ExampleReport A
Report B
Data WarehouseLarge Database
Subject-Oriented
Integrated
Time-Variant
Nonvolatile
User-Friendly Interface
Data Warehouse System
Oper-ational DB
OtherDB
ExternalDB
DataWare-house
Reporting
Data Mining
KM
Expert
Data MartsMini data warehouses
Hold subsets of data from the data warehouse
Data focuses on a specific aspect of the company
Data Mining ToolsSoftware that searches through data
Uses complex statistical calculations
OutputsTrendsPatternsCorrelationsExceptions
Knowledge Management• Process
• Creating Value from Intellectual Property
• Sharing Knowledge with Others
Expert SystemsKnowledge captured as rules.
BI ChallengesPoor data quality
Context
User resistance
Database Management System (DMBS)
Database Management System (DMBS)
DBMS contains information about a particular enterprise• Collection of interrelated data• Set of programs to access the data • An environment that is both convenient and efficient to use
Database Applications:• Banking: all transactions• Airlines: reservations, schedules• Universities: registration, grades• Sales: customers, products, purchases• Online retailers: order tracking, customized recommendations• Manufacturing: production, inventory, orders, supply chain• Human resources: employee records, salaries, tax deductions
Databases touch all aspects of our lives
Purpose of Database Systems
In the early days, database applications were built directly on top of file systemsDrawbacks of using file systems to store data:
Data redundancy and inconsistencyMultiple file formats, duplication of information in different files
Difficulty in accessing data Need to write a new program to carry out each new task
Data isolation — multiple files and formatsIntegrity problems
Integrity constraints (e.g. account balance > 0) become “buried” in program code rather than being stated explicitlyHard to add new constraints or change existing ones
Purpose of Database Systems (contd.)
Drawbacks of using file systems (cont.) Atomicity of updates
Failures may leave database in an inconsistent state with partial updates carried outExample: Transfer of funds from one account to another should either complete or not happen at all
Concurrent access by multiple usersConcurrent accessed needed for performanceUncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the same time
Security problemsHard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
View of DataView of DataAn architecture for a database system
View 1
Physicallevel
Logical level
View 2 View n…
View level
Levels of Abstraction
Physical level: describes how a record (e.g. customer) is stored.Logical level: describes data stored in database, and the relationships among the data.
type customer = record name: string; street: string; city: integer;end;
View level: application programs hide details of data types. Views can also hide information (e.g. salary) for security purposes.
Instances and SchemasSimilar to types and variables in programming languages
Schema – the logical structure of the database Example: The database consists of information about a set of customers and accounts and the relationship between them)Analogous to type information of a variable in a programPhysical schema: database design at the physical levelLogical schema: database design at the logical level
Instance – the actual content of the database at a particular point in time Analogous to the value of a variable
Physical Data Independence – the ability to modify the physical schema without changing the logical schema
Applications depend on the logical schemaIn general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.
Data IndependenceAbility to modify a schema definition in one level without affecting a schema definition in the other levels.The interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.Two levels of data independence
Physical data independenceLogical data independence
Data ModelsA collection of tools for describing:
DataData relationshipsData semanticsData constraints
Object-based logical modelsEntity-relationship modelObject-oriented modelSemantic modelFunctional model
Record-based logical modelsRelational model (e.g., SQL/DS, DB2)Network modelHierarchical model (e.g., IMS)
The Entity-Relationship Model
Models an enterprise as a collection of entities and relationships
Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects
Described by a set of attributesRelationship: an association among several entities
Represented diagrammatically by an entity-relationship diagram:
Relational Model
Example of tabular data in the relational model:
name ssn street city account-numberJohnson 192-83-7465 Alma Palo Alto A-101Smith 019-28-3746 North Rye A-215Johnson 192-83-7465 Alma Palo Alto A-201Jones 321-12-3123 Main Harrison A-217Smith 019-28-3746 North Rye A-201
account-number balanceA-101 500A-201 900A-215 700A-217 750
Data Definition Language (DDL)Specification notation for defining the database schemaDDL compiler generates a set of tables stored in a data dictionaryData dictionary contains metadata (data about data)Data storage and definition language – special type of DDL in which the storage structure and access methods used by the database system are specified
Data Manipulation Language (DML)
Language for accessing and manipulating the data organized by the appropriate data modelTwo classes of languages
Procedural – user specifies what data is required and how to get those dataNonprocedural – user specifies what data is required without specifying how to get those data
Transaction ManagementA transaction is a collection of operations that performs a single logical function in a database application.Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g. power failures and operating system crashes) and transaction failures.Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.
Storage ManagementStorage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system.The storage manager is responsible to the following tasks:
Interaction with the file manager Efficient storing, retrieving and updating of data
Issues:Storage accessFile organizationIndexing and hashing
Database AdministratorCoordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs:
Database administrator’s duties include:Schema definitionStorage structure and access method definitionSchema and physical organization modificationGranting user authority to access the databaseSpecifying integrity constraintsActing as liaison with usersMonitoring performance and responding to changes in requirements
Database Users
Users are differentiated by the way they expect to interact with the system.Application programmers: interact with system through DML calls.Specialized users: write specialized database applications that do not fit into the traditional data processing framework Sophisticated users: form requests in a database query language.Naive users: invoke one of the permanent application programs that have been written previously
Overall System Structure
naïve users application sophisticated database (tellers, agents, etc) programmers users administrator
application interface
Application program
query database scheme
users
application program object code query evaluation
engine
Embedded DML precompiler
DML compiler
DDL interpreter
transaction manager
buffer manager
query processor
storage manager
database- management system
File manager
indices Statistical data
Data files Data dictionary
disk storage
SQL (Structured Query Language)
SQL: widely used non-procedural languageExample: Find the name of the customer with customer-id 192-83-7465
select customer.customer_namefrom customerwhere customer.customer_id = ‘192-83-7465’
Example: Find the balances of all accounts held by the customer with customer-id 192-83-7465
select account.balancefrom depositor, accountwhere depositor.customer_id = ‘192-83-7465’ and
depositor.account_number = account.account_number
Application programs generally access databases through one ofLanguage extensions to allow embedded SQLApplication program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a database
Database Design
The process of designing the general structure of the database:
Logical Design – Deciding on the database schema. Database design requires that we find a “good” collection of relation schemas.
Business decision – What attributes should we record in the database?Computer Science decision – What relation schemas should we have and how should the attributes be distributed among the various relation schemas?
Physical Design – Deciding on the physical layout of the database
Data Warehousing
In the Beginning, life was simple…
But…
Our information needs…
Kept growing. (The Spider web)
SOURCE: William H. Inmon
Purpose
To explore and discuss the purpose and principles of
data warehousing.
So What Is a Data Warehouse?
A data warehouse is the data repository of an enterprise. It is generally used for research and decision support.
By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.
OLTP systems are usually designed independently of each other and it is difficult for them to share information.
Why Do We Need Data Warehouses?
Consolidation of information resourcesImproved query performanceSeparate research and decision support functions from the operational systemsFoundation for data mining, data visualization, advanced reporting and OLAP tools
What Is a Data Warehouse Used for?
Knowledge discoveryMaking consolidated reportsFinding relationships and correlationsData miningExamples
Banks identifying credit risksInsurance companies searching for fraudMedical research
How Do Data Warehouses Differ From Operational Systems?
Data warehouse Operational systemSubject oriented Transaction oriented
Large (hundreds of GB up to several TB)
Small (MB up to several GB)
Historic data Current data
De-normalized table structure (few tables, many columns per table)
Normalized table structure (many tables, few columns per table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries
Design Differences
Star Schema
Data WarehouseOperational System
ER Diagram
Supporting a Complete Solution
Operational System-Data Entry
Data Warehouse-Data Retrieval
What Is a Data Warehouse?In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information
For doing data validation, complex mining, analysis, prediction, …This is the data warehouse
To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database
For performance reasons, availability reasons, archival reasons, …
Master Data ManagementOne of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization
A cleaned, validated repository of what we know… which can be linked to by data sources… which may help with data cleaning
… and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)
There is an emerging field called master data management out the process of creating these
Data Warehouses, Data Marts, and Operational Data Stores
Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.
Building a Data Warehouse
AnalysisDesign
Import dataInstall front-end tools
Test and deploy
Data Warehouse Lifecycle
Stage 1: Analysis
Identify:Target Questions
Data needsTimeliness of data
Granularity
Create an enterprise-level data dictionaryDimensional analysisIdentify facts and dimensions
Analysis– Design– Import data– Install front-end tools– Test and deploy
Stage 2: Design
Star schemaData TransformationAggregates Pre-calculated ValuesHW/SW Architecture
– Analysis
Design– Import data– Install front-end tools– Test and deploy
Dimensional Modeling
Dimensional Modeling
Fact Table – The primary table in a dimensional model that is meant to contain measurements of the business.Dimension Table – One of a set of companion tables to a fact table. Most dimension tables contain many textual attributes that are the basis for constraining and grouping within data warehouse queries.
Stage 3: Import Data
Identify data sourcesExtract the needed data from existing systems to a data staging areaTransform and Clean the data
Resolve data type conflictsResolve naming and key conflictsRemove, correct, or flag bad dataConform Dimensions
Load the data into the warehouse
– Analysis– Design
Import data– Install front-end tools– Test and deploy
Importing Data Into the Warehouse
OLTP 1
OLTP 2
OLTP 3
Data Staging Area DataWarehouse
Operational Systems(source systems)
Stage 4: Install Front-end Tools
Reporting toolsData mining toolsGISEtc.
– Analysis– Design– Import data
Install front-end tools– Test and deploy
Stage 5: Test and Deploy
Usability testsSoftware installationUser trainingPerformance tweaking based on usage
– Analysis– Design– Import data– Install front-end tools
Test and deploy
On-Line Analytical Processing (OLAP)
Making Decision Support Possible
69
Strengths of OLAP
• It is a powerful visualization paradigm
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and outliers
• Many vendors offer OLAP tools
70
Month1 2 3 4 76 5
Pro
du
ct
Toothpaste
JuiceColaMilk
Cream
Soap
Regio
n
WS
N
Dimensions: Product, Region, TimeHierarchical summarization paths
Product Region TimeIndustry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi-dimensional Data
• “Hey…I sold $100M worth of goods”
71
Data Cube Lattice• Cube lattice
• ABC AB AC BC A B C none
• Can materialize some groupbys, compute others on demand• Question: which groupbys to materialze?• Question: what indices to create• Question: how to organize data (chunks, etc)
72
Visualizing Neighbors is simpler
1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar
Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2
73
A Visual Operation: Pivot (Rotate)
10
47
30
12
Juice
Cola
Milk
Cream
NYLA
SF
3/1 3/2 3/3 3/4
Date
Month
Reg
ion
Product
74
“Slicing and Dicing”
Product
Sales Channel
Regio
ns
Retail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
75
Roll-up and Drill Down
• Sales Channel• Region• Country• State • Location Address• Sales Representative
Roll
Up
Higher Level ofAggregation
Low-levelDetails
Drill-D
ow
n
76
Nature of OLAP Analysis• Aggregation -- (total sales,
percent-to-total)• Comparison -- Budget vs. Expenses• Ranking -- Top 10, quartile analysis• Access to detailed and aggregate
data• Complex criteria specification• Visualization
77
Organizationally Structured Data
• Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data
marketing
manufacturing
sales
finance
78
Multidimensional Spreadsheets
• Analysts need spreadsheets that support
• pivot tables (cross-tabs)• drill-down and roll-up• slice and dice• sort• selections• derived attributes
• Popular in retail domain
79
OLAP - Data Cube
• Idea: analysts need to group data in many different ways• eg. Sales(region, product, prodtype, prodstyle, date, saleamount)• saleamount is a measure attribute, rest are dimension attributes• groupby every subset of the other attributes
• materialize (precompute and store) groupbys to give online response• Also: hierarchies on attributes: date -> weekday,
date -> month -> quarter -> year
80
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in industry standard RDBMS.
Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.
Obtain multi-dimensional reports from the DSS Client.
81
MD-OLAP: 2 Tier DSSMDDB Engine MDDB Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.
Obtain multi-dimensional reports from the DSS Client.
OLAPNeed for More Intensive Decision Support4 Main Characteristics
Multidimensional data analysisAdvanced Database SupportEasy-to-use end-user interfacesSupport Client/Server architecture
Multidimensional Data Analysis Techniques
Advanced Data Presentation Functions3-D graphics, Pivot Tables, Crosstabs, etc.Compatible with Spreadsheets & Statistical packagesAdvanced data aggregations, consolidation and classification across time dimensionsAdvanced computational functionsAdvanced data modeling functions
Advanced Database Support
Advanced Data Access FeaturesAccess to many kinds of DBMS’s, flat files, and internal and external data sourcesAccess to aggregated data warehouse dataAdvanced data navigation (drill-downs and roll-ups)Ability to map end-user requests to the appropriate data sourceSupport for Very Large Databases
Easy-to-Use End-User Interface
Graphical User InterfacesMuch more useful if access is kept simple
Client/Server ArchitectureFramework for the new systems to be designed, developed and implementedDivide the OLAP system into several components that define its architecture
Same ComputerDistributed among several computer
OLAP Architecture3 Main Modules
GUI (Graphical User Interface)Analytical Processing LogicData-processing Logic
OLAP Client/Server Architecture
Relational OLAPRelational Online Analytical Processing
OLAP functionality using relational database and familiar query tools to store and analyze multidimensional data
Multidimensional data schema supportData access language & query performance for multidimensional dataSupport for Very Large Databases
Multidimensional Data Schema Support
Decision Support Data tends to beNonnormalizedDuplicatedPreaggregated
Star SchemaSpecial Design technique for multidimensional data representationsOptimize data query operations instead of data update operations
Star SchemasData Modeling Technique to map multidimensional decision support data into a relational databaseCurrent Relational modeling techniques do not serve the needs of advanced data requirements
Data Warehouse Architecture
At the top – a centralized database
Generally configured for queries and appends – not transactionsMany indices, materialized views, etc.
Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools
Data Warehouse
ETL ETL ETL ETL
RDBMS1 RDBMS2
HTML1 XML1
ETL pipelineoutputs
ETL
ETL ToolsETL tools are the equivalent of schema mappings in virtual integration, but are more powerfulArbitrary pieces of code to take data from a source, convert it into data for the warehouse:
import filters – read and convert from data sourcesdata transformations – join, aggregate, filter, convert datade-duplication – finds multiple records referring to the same entity, merges themprofiling – builds tables, histograms, etc. to summarize dataquality management – test against master values, known business rules, constraints, etc.
Example ETL Tool Chain
• This is an example for e-commerce loading
• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load
Invoice line items
SplitDate-time
Filterinvalid
JoinFilterinvalid
Invalid dates/ times
Invaliditems
Itemrecords
Filternon -
match
Invalidcustomers
Group by customer
Customerbalance
Customerrecords
An Introduction to Text Mining
Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: Defined
Predictive analysis:•Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data•It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities•It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizensBottom line: it turns data into effective actions that positively impact your bottom line
Unstructured Data Management
Text Mining is a subset of Unstructured Data Management.
UDM can be broken down into:Content and Document ManagementSearch and Retrieval XML database and toolsCategorization, Classification, and Visualization
Data Warehouse vs. Document Warehouse
Data warehouseWho, what, when, where, how muchInternally focusedOperational information Rarely include external information
Document warehouseWhyMay not be internally focusedMay contain a range of informationOften integrate external information
Document Warehouse Features
There is no single document structure or document typeDocuments are drawn from multiple sourcesEssential features of documents are automatically extracted and explicitly stored in the document warehouseDocument warehouses are designed to integrate semantically related documents
100
Building the Document Warehouse
IdentifySources
RetrieveDocument
TextAnalysis
Pre-processDocument
CompileMetadata
101
Predict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business User
Grow
Retain
Fraud
Outcomes
Attract
Data Collection
Text
Surveys
WebChannel
OperationalSystems
Text
Bu
sin
ess
UI
Expert UIExpert UI
Concepts
Concept Maps
Clustering
Categoriza-tion
Trending
Information Extraction
Prediction
NLP
The Building Blocks of Language
MorphologySyntaxSemanticsPhonologyPragmatics
MorphologyUnderstanding words
StemsAffixes
PrefixSuffix
Inflectional elements
Reducing complexity of
analysis
Reduces complexity of
representation
Supports text mining
Noun
PrefixNoun Stem
Suffix
- abledisputein -
SyntaxThe Bank of Canada will curb inflation with higher interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
NounVerbAux
Noun phrase
NounAdjective
Noun
The Bank ofCanada
inflationcurbwill
Interest rateshigher
with
SemanticsThe meaning of it allApproaches to meaning
Semantic networksDeductive logicRule-based systems
Useful for classification
Problems with NLPLimitations of Natural Language Processing
Correctly identifying the role of noun phrasesRepresenting abstract conceptsClassifying synonymsRepresenting the number of concepts
Problems with NLPLimitations of technology
Language specific designs are requiredClassification speedClassifying hybrid words and sentences
Underlying Technology is Based on Linguistics
The Linguistic Approach:Does not treat a document as a bag of words
Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language dependent.
From Text to Concepts
Morphology
Syntax
Semantics StatisticsLinguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-Oriented
•Compound words•Proper nouns•Figures•Named entities•Domain specifics
•Speed•Multiple formats•Multiple languages
•SPSS dictionaries•User dictionaries•Extraction rules•Extraction patterns
•Known terms•Unknown terms•New terms
•Trends
“The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.”- The Gartner group.
Data Mining
Why data mining?Data Mining software generally employs modeling algorithms
designed to handle non-linearities and unusual patterns in dataAs opposed to classical linear models (e.g., linear regression) that aren’t as
capable
A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output
Use the cross industry standard process for data mining (CRISP-DM)
Based on real-world lessons:Focus on business issues
User-centric & interactiveFull processResults are used
A Data Mining Methodology
Data Mining is not…Keep in mind that data mining is not…
“Blind” application of analysis/modeling algorithmsBrute-force crunching of bulk dataBlack box technologyMagic
114
Back to the Process
Text Mining
UnderstandingBusiness Understanding
Determine objectiveAssess situationDetermine data mining goalsProduce project plan
Data UnderstandingCollect initial dataDescribe dataExplore dataVerify data quality
Data PreparationData
Data setData set descriptionSelect dataClean dataConstruct data set / Integrate dataFormat data
TextConcept extractionConcept combinationConcept assessment
ModelingSelect modeling technique
Universe of techniquesAppropriate techniques
DataText
RequirementsConstraintsSelected tools
Generate test designRun model(s)Assess model(s)
EvaluationResults = Models + Findings
Evaluate resultsReview processDetermine next steps
DeploymentPlan deploymentPlan monitoring and maintenanceFinal reportProject review
Unsupervised methods:Group patients by drugs and demographic information and try to find
unusual patients
Supervised methods:Attempt to predict amount due and find sets of cases where the amount
due is very different from the predicted amount
Data Mining Approaches
Copyright 2003-4, SPSS Inc. 121
What Does Data Mining Do?• Data mining uses existing data to:
• Predict• Category membership• Numeric Value• Ie. Credit risk
• Group• Cluster (group) things together based
on their characteristics• Ie. Different types of TV viewers
• Associate• Find events that occur together, or in
a sequence• Ie. Beer and diapers
• Find outliers• Identify cases that don’t follow
expected behavior• Ie. Fraudulent behaviour
Data Mining vs. Text MiningData Mining
• Process directly• Identify causal
relationship• Structured numeric
transaction data residing in rational data warehouse
• Text Mining• Linguistic processing or
natural language processing (NLP)
• Discover heretofore unknown information[2]
• Applications deal with much more diverse and eclectic collections of systems and formats[4]
ConfusionIs text mining the same as information extraction? No!Information Extraction (IE)
Extract facts about pre-specified entities, events or relationships from unrestricted text sources.No novelty: only information is already present is extracted.
Two Foundations
Information Retrieval (IR)
Artificial Intelligence (AI)
Information RetrievalThe science of searching for
• Information in documents• Documents themselves• Metadata which describe documents• Text, sound, images or data, within database:
relational stand-alone database or hypertext networked databases such as the Internet or intranets.
Information RetrievalGerard Salton
Functional overview of IR
Artificial IntelligenceArtificial intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.
Natural Language ProcessingNatural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Statistical natural language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties : e.g. text segmentation, word sense disambiguation
Data Warehousing, OLAP and data mining: what and why?
130
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker make faster and better decisions
Which of my customers are most likely to go to the competition?
What product promotions have the biggest impact on revenue?
How did the share price of software companies correlate with profits over last 10 years?
Decision SupportUsed to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and end-users to understand the business and make judgements
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
We want to know ...Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?
If I raise the price of my product by Rs. 2, what is the effect on my ROI?
If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?
If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
What makes data mining possible?
Advances in the following areas are making data mining deployable:• data warehousing • better and more data (i.e., operational, behavioral, and
demographic) • the emergence of easily deployed data mining tools and • the advent of new data mining techniques.
Why Separate Data Warehouse?
Missing data: Decision support requires historical data, which op dbs do not typically maintain.
Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.