gio wiederhold pdm 1 profiting from data mining gio wiederhold november 2003
Post on 19-Dec-2015
219 Views
Preview:
TRANSCRIPT
Gio Wiederhold PDM 2
Steps needed to profit
1. Obtaining relevant data– Always incomplete
2. Extracting relationships– Imputing causality
3. Finding applicability– Determining leverage points
4. Inventing candidate actions– Assessing likely outcomes and benefits
5. Selecting action to be taken– Measuring the outcome
Collecting data for next round
?Model
based
Gio Wiederhold PDM 3
Today's Problem: Disjointness
1. Database administrators• Focus on data collection, organization, currency
2. Analysts• Focus on slicing, dicing, relationships
3. Middle managers• Focus on their costs, profits
4. MBAs• Focus on business models, planning
5. Executives• Must make decisions based on diverse inputs
Gio Wiederhold PDM 4
1. Data Collection
Two choices1. (rare) Collect data specifically for analysis
allows careful design -- model causes and effects
Purchase = f(price, color, size, custumer inc., gender,. ,, costly often small to make collection manageable
imposes delays2. (common) Use data collected for other purposes
take advantage of what is readily available low cost filtering, reformatting, integration
incomplete - rarely covers all causes / effects biased -- missing categories
only people with phones, cars -- shopping in super markets
Gio Wiederhold PDM 5
1a. Data Integration
Needed when sources have inadequate coverage
• in distinct DBs for– Prices, Number purchased – Customer segments (supermarket, stores, on-line)
implies some expectations
append attributes where keys match: Joe
include semantic match Joe = 012 34 567
append rows where key types match: customer
include semantic match customer = owner
Gio Wiederhold PDM 6
2. Data analyis
• Find relationships– already known - ignore or adjust in next round
» requires comparison with expert knowledge» now have quantification
– unknown» uninteresting per expert» interesting per expert
Gio Wiederhold PDM 7
3. Establish causality
• Already known -- Prior Model – But is it complete, i.e., does it explain all effects ?
• Analyze relationships– use expertise to decide direction
» often obvious "common world knowledge"
» sometimes ambiguoussmoking Cancer not-smoking
» often major true cause not captured in datafood color 10%, food price 20%, buyer gender 2% unknown 75%guess: ethnicity, income
purchase of Chinese vs other food
invent surrogates: names, ZIP codes,
use temporal information
Gio Wiederhold PDM 8
Establishing causality is risky
1. Is a Volvo a safe car?
2. What causes accidents? Drivers!
3. Who buys Volvos?
4. Must determine• effect of safe drivers• percentage of safe drivers overall• percentage of safe drivers with Volvos
5. How much of the accident rate is now explained?
The unexplained difference can be attributed to the car.
Careful drivers!
Mined: Volvos have fewer accidents
Gio Wiederhold PDM 9
Change causecreate effects
To use results of data mining
• have to understand direction of relationships
interesting beneficialeffects
side effects
side effectscontrollablecauses
externalcauses
hiddencaptured by data
Model
Gio Wiederhold PDM 10
4. Causes provide the leverage
Language of analyst / Language of modeling
• Many causes -- independent variables– A few may be controllable– Some may be controlled by our competition– Others are forces-of-nature
• Even more effects -- dependent variables– A few may be desired– Some may be disastrous– Many are poorly understood
• Intermediate effects – Provide a means for measuring effectiveness– Allow correction of actions taken
Gio Wiederhold PDM 11
5. Planning & Assessment
Analyze Alternatives
• Current Capabilities
• Future Expectations
Process tasks:
• List resources
• Enumerate alternatives
• Prune alternative
• Compare alternatives
now
Predict
the
future
Gio Wiederhold PDM 13
Simulations predict
1. Back-of-the-envelope• Common• Adequate if model is simple• Assumptions are easily forgotten after some time,
not distinguished from data "Why are we doing this"
2. Spreadsheets• Most common computing tool• Specialist modeler can help• New, recent data can be pasted in • Awkward for the tree of future alternatives
3. Constructed to order• Costly, powerful technology• Specialist modelers required• Expressive simulation languages• Requires specialists to set up, run, and rerun with new data
Iv gH Xy mN
DM
Gio Wiederhold PDM 14
Simulation results: likelihoods
timetime
Next period alternatives
uncertainty increases
and subsequent periods
0.4
0.60.18
0.15
0.13
0.25
0.2
0.17
0.4
0.3
0.19
nownow
0.1
0.11
0.12
0.3
Gio Wiederhold PDM 15
Simulation services
Wide variety, but common principle
Inputs Model Output (time, $, place, ...)
1. Spreadsheets
Identify independent, controlable, and resulting values
2. Execution specific to query: what-if assessment– may require HPC power for adequate response
3. Continously executing: weather prediction– Search for best match ( location, time )
4. Past simulations results collected for future useTypically sparse -- the dimension of the futures is too large:
– Tables in a design handbook: materialsPerform inter- or extra-polations to match query parameters
Gio Wiederhold PDM 16
6. Specify Value of Effects
Still needed: Value of alternative outcomes• Decision maker / owner input
– Benefits and Costs– Potential Profit
– Correct for risk, and adjust to present value
past now futurespast now futures
10001000
20002000
50005000
10001000
00
-2000-2000
-6000-6000
ValuesValues
timetime
Gio Wiederhold PDM 17
Having it all together
• Relationships from analyses of past data
• Data representing the current state
• List of actionable alternatives
• Tree of subsequent alternatives
• Probabilities of those alternatives
• Values of the outcomes
• Ability to predict the likelihood of futures
0.4
0.60.18
0.15
0.13
0.25
0.2
0.17
0.4
0.3 0.19 0.1
0.11
0.12
0.3 10001000
20002000
50005000
10001000
00
-2000-2000
-6000-6000
ValuesValues
Gio Wiederhold PDM 18
Vision: Putting it all together
Combine results mined from past data, current observations, and predictions into the future.
o o
o oo o
timetime
Support specialistsSupport specialists
Decision MakerDecision Maker
Gio Wiederhold PDM 19
Needed: Information Systems that alsoproject seamlessly into the Futures
Support of decision-making requires dealing with the futures, as well the past
• Databases deal well with the past
• Streaming sensors supply current status
• Spreadsheets, simulations deal with the likely futures
Future information systems should combine all these sources
timetimepast now futurepast now future
Gio Wiederhold PDM 20
Connecting it all
Build super systems• Coherent, consistent
• Expensive
• Unmaintainable
• Too many cooks: – Database folk– Data miners– Analysts– Planners– Simulation specialists– Decision makers
Develop interfaces• Incremental
• Composable as needed
• Heterogeneous
• Interfaces required: Metadata– Database to miners: SQL
– Mined results to analysts: XML?
– Analysts to planners ?
– Planners to Simulations? SimQL
– Decision makers: New tools !
Gio Wiederhold PDM 21
Interfaces enable integration:New: SimQL to access Simulations
timetimepast now futurespast now futures
Msgsystems,Sensors
Streaming data
Databases and schemas, accessed via SQL or XML
Simulations, accessed via SimQL and
schema compliant wrappers
Gio Wiederhold PDM 22
Parser
MetadataManager
Querymanager
SchemaManager
Wrapped .. SimulationsMetadata
DevelopmentInteraction
Production Interaction
Filing ofAccessSpecs
Use of AccessSpecs
Initiation and Results of Simulations
SchemaCommands
SchemaCommands
Help
Errorreports
CustomerDeveloper
Help
Query
SimQL proof-of-concept ImplementationSimQL proof-of-concept Implementation
o o
Gio Wiederhold PDM 23
Demonstration of SimQL
Business planningspreadsheets
Weather onthe Internet
Engineering simulation
wrapper wrapper wrapper
Test Applications
Simple GUIcommon language
requirements
Shippinglocation database
Gio Wiederhold PDM 24
Information system use of simulation results
Simulation results are mapped to alternative Courses-of-actionsInformation system should support model
driving the the computation and recomputation of likelihoods
Likelihoods change as now moves forwards and eliminates earlier alternatives.
timetime0.40.4
0.60.6
0.20.2
0.50.5
0.30.3
0.50.5
0.20.20.10.1
0.10.1
0.10.1
0.030.030.070.07
0.10.1
0.50.5 0.30.3
0.20.2
prob
Gio Wiederhold PDM 25
The likelihoods multiply out to the end-effects then their values can be applied to earlier
nodes
10001000
20002000
50005000
10001000
00
-6000-6000
-3000-3000
ValuesValues
12001200
6666
134134
-1220-1220
12661266
--10861086
past now futurepast now futuretimetime
Next period alternatives
0.4
0.6
0.1
.
and subsequent periods
prob
0.1
.
0.2
0.1
0.5
0.30.2
0.1
0.07
0.4
0.3
0.13
.
0.3
0.2
value
100100
600600
1100 5001100 500
200 200200 200
-420 0-420 0
-820 -400-820 -400
Gio Wiederhold PDM 26
Recomputation is needed at the next time phase
past now futurepast now future
Re-assess as timeRe-assess as timemarches forward !marches forward !
A Pruned Bush A Pruned Bush
Databases, . . .Spreadsheets,
other simulations,
Msgssensors
10001000
20002000
50005000
10001000
00
100 100
600600
1100 5001100 500
200 200200 200
00
12001200
6666
timetime
1266 ?1266 ?
?? ??
Gio Wiederhold PDM 27
Even the present needs SimQL
timetimepastpast now now futurefuture
last recorded observations
simple simulationsto extrapolate data
Is the delivery truck in X?
• Is the right stuff on the truck?
• Will the crew be at X?
• Will the forces be ready to accept delivery?
point-in-time for situational assessment
Not all data are current:
Gio Wiederhold PDM 28
Integrative information systems: research questions
• What human interfaces can support the decision maker?
• How to move seamlessly from the past to the future?
• What system interfaces are good now and stay adaptable
• How can multiple futures be managed (indexed)?
• How can multiple futures be compared, selected?
• How should joint uncertainty be computed?
• How can the NOW point be moved automatically?
Gio Wiederhold PDM 29
SimQL research questions
• How little of the model needs to be exposed?
• How can defaults be set rationally?
• How should expected execution cost be reported?
• How should uncertainty be reported?
• Are there differences among application areas that require different language structures?
• Are there differences among application areas that require different language features?
• How will the language interface support effective partitioning and distribution?
Gio Wiederhold PDM 30
Moving to a Service Paradigm
Interfaces define service potentials
• Server is an independent contractor, defines service
• Client selects service, and specifies parameters
• Server’s success depends on value provided
• Some form of payment is due for services
x,y
Databases are a current example.Simulations have the same potential.
Gio Wiederhold PDM 31
Summary of SimQL
A new service for Decision Making:• follows database paradigm
– ( by about 25 years )
• coherence in prediction– displacement of ad-hoc practices
• seamless information integration – single paradigm for decision makers
• simulation industry infrastructure– investment has a potential market– should follows database industry model:
Interfaces promote new industries
Gio Wiederhold PDM 32
extensions for network support are also disjoint
Do no
t int
erop
erat
e
Summary:Today decision making support is disjoint, each community improves its area and ignores others
Distribution
Databases
Simulation
Planning Science
Gio Wiederhold PDM 33
The decisionmaker has few tools
• Spreadsheets
• Planning of allocations
• Other simulations
various point assessments
past now futurepast now futuretimetime
Data integration
distributed, heterogeneous
x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a
Databases
Intuition +
organized support disjointed support
Gio Wiederhold PDM 34
DatabasesDatabases
Coda: Put relevant work together and move on
Support integration of results mined from past data, current observations, and predictions about the
futures.
o o
Simulation Support ServicesSimulation Support Services
Decision MakerDecision Maker
Service interfaces
Human interfaces
Data MiningData Mining
o oModeling toolsModeling toolso o
?
RealReal
InformationInformation
SystemsSystems
top related