![Page 1: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/1.jpg)
TEAM MEMBERS: HAO ZHOU, YUJIA TIAN, PENGFEI GENG, YANQING ZHOU, YA LIU, KEXIN LIU
DIRECTOR:PROF. ELKE A. RUNDENSTEINERPROF. IAN H. WITTEN
Data Mining with WekaPutting it all together
![Page 2: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/2.jpg)
Outline
5.1 The data mining process By Hao
5.2 Pitfalls and pratfalls By Yujia, Pengfei
5.3 Data mining and ethics By Yanqing
5.4 Summary By Ya, Kexin
![Page 3: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/3.jpg)
5.1 The data mining processBy Hao
![Page 4: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/4.jpg)
5.1 The data mining process
Feel Lucky:- Weka is not everything I need
to talk about in my part (Know how rather than why to use Weka)
Maybe Not so Lucky:- Talking about Weka is time-
consuming. =)
![Page 5: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/5.jpg)
From Weka to real life
When we use weka for MOOC, we never care about the dataset, as it has been already collected.
![Page 6: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/6.jpg)
Procedures in real life
Why we do data mining in real life?- for course projects (This is my current
situation)- for solving real life problem - for fun- for …
Now, we have specified our “question[1]”, then what we do is to gather the data[2] we need.
![Page 7: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/7.jpg)
Real life projectThis summer vacation, I worked as volunteer
programmer(no payment) for a start-up, whose objective is to provide article recommendations for developers[1].
In this case, we must keep our database, which will index all the up-to-date articles we gather from the whole Internet(mongoDB).
We use many ways to gather articles, and I just focused on one of them – Get articles links from influencers’ tweets through APIs.
![Page 8: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/8.jpg)
Procedures in real lifeDo all the links I gathered work? - Never, even I wish they did
1. Due to algorithm issue, some links I got are in bad format.
2. Even links are correct, I cannot get articles from all links, as some of them are not links for articles.
[3. More problems after getting articles from links]
-- We must do some clean up[3], after we gathered our data, to better use it.
![Page 9: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/9.jpg)
Procedures in real life
OK, assume that now we have all the [raw] data(articles here) we need.
The most important jobs comes – one of them is how to rank articles for different keywords [how to define keywords collection]. (It is more about mathematics issue than computer science here, and I did not participate in this part)
-- Define new features
![Page 10: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/10.jpg)
Procedures in real life
After new features defined, the last step is to generate a web app, so that users can enjoy “our” work.
Now the last step of this project is still under construction, which means “we” still need more time to “deploy the result”.
We will go to section 5.2 now -->
![Page 11: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/11.jpg)
5.2 Pitfalls and pratfallsBy Yujia, Pengfei
![Page 12: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/12.jpg)
5.2 Pitfalls and pratfalls
Pitfall: A hidden or unsuspected danger or difficulty
Pratfall: A stupid and humiliating action
Tricky parts and how to deal with them
![Page 13: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/13.jpg)
Be skeptical
In data mining, it’s very easy to cheat whether consciously or unconsciously
For reliable tests, use a completely fresh sample of data that has never been seen before
![Page 14: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/14.jpg)
Overfitting has many faces
Don’t test on the training set (of course!)
Data that has been used for development (in any way) is tainted
Leave some evaluation data aside for the very end
Key: always test on completely fresh data.
![Page 15: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/15.jpg)
Missing values“Missing” means what … Unknown? Unrecorded? Irrelevant?Missing valuesOmit instances where the attribute value is missing?
or Treat “missing” as a separate possible value?Is there significance in the fact that a value is
missing?Most learning algorithms deal with missing values– but they may make different assumptions about them
![Page 16: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/16.jpg)
An Example
OneR and J48 deal with missing values in different ways
Load weather‐nominal.arffOneR gets 43%, J48 gets 50% (using 10‐fold
cross‐validation)Change the outlook value to unknown on the
first four no instancesOneR gets 93%, J48 still gets 50%Look at OneR’s rules: it uses “?” as a fourth
value for outlook
![Page 17: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/17.jpg)
An Example
![Page 18: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/18.jpg)
5.2 Pitfalls and pratfallsPart 2 By Pengfei
![Page 19: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/19.jpg)
No “universal” best algorithm, No free lunch
2‐class problem with 100 binary attributes
Say you know a million instances, and their classes (training set)
You don’t know the classes of 99.9999…% of the data set
How could you possibly figure them out
Example
![Page 20: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/20.jpg)
No “universal” best algorithm, No free lunch
In order to generalize, every learner must embody some knowledge or assumptions beyond the data it’s given
Delete less useful attributesFind better filterData mining is an experimental science
![Page 21: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/21.jpg)
5.3 Data mining and ethicsBy Yanqing
![Page 22: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/22.jpg)
5.3 Data mining and ethics
Information privacy lawsAnonymization The purpose of data miningCorrelation causation
Source: www.mum.eduSource: www.ediscoveryreadingroom.com
Source: www.zerohedge.com
Source: www.johnmyleswhite.com
![Page 23: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/23.jpg)
Information privacy lawsIn Europe
Purpose; Keep secret; Accurately update; Provider can review; Deleted asap; Un-transmittable (if less protection) No sensitive data (sexual orientation, religion )
In US Not highly legislated or regulated Computer Security, Privacy and Criminal
Law But hard to be anonymous...
Be aware ethical issues and laws AOL (2006)
650,000 users (3days in public web) at least $5,000 for the identifiable person
Source: www.livingwithgod.org
Source: blog.brainhost.com
![Page 24: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/24.jpg)
Anonymization
It is much harder than you think. Story: MA release medical records (mid‐1990s) No name, address, social security number Re-identification technique
Public records: City, Birthday, gender: 50% of US can be identify One more attribute – zipcode: 85% identification
Netflix Use movie rating system to identify people 99% 6 movies 70% 2 movies
Source: eofdreams.com
www.resteasypestcontrol.com
![Page 25: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/25.jpg)
The purpose of data miningThe purpose of data mining is to discriminate …
who gets the loan who gets the special offer
Certain kinds of discrimination are unethical, and illegal racial, sexual, religious, …
But it depends on the context sexual discrimination is usually illegal … except for doctors, who are expected to take gender into account
… and information that appears innocuous may not be ZIP code correlates with race membership of certain organizations correlates with gender
![Page 26: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/26.jpg)
Correlation and Causation
Correlation does not imply causation As icecream sales increase, so does the rate of
drownings. Therefore icecream consumption causes drowning???
Data mining reveals correlation, not causation but really, we want to predict the effects of our actions
Source: commons.wikimedia.org Source: www.thevisualeverything.comSource: www.michaelnielsen.org
![Page 27: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/27.jpg)
5.3 Summary
Privacy of personal informationAnonymization is harder than you thinkReidentification from supposedly anonymized
dataData mining and discriminationCorrelation does not imply causation
![Page 28: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/28.jpg)
5.4 Summary By Ya, Kexin
![Page 29: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/29.jpg)
5.4 SUMMARY
There’s no magic in data mining– Instead, a huge array of alternative techniques
There’s no single universal “best
method” – It’s an experimental science!– What works best on your problem?
![Page 30: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/30.jpg)
5.4 SUMMARY
Produce comprehensive models
When attributes contribute equally and independently to the decision
Simply stores the training data without processing it
Calculate a linear decision boundary
Avoids overfitting, even with large numbers of attributes
Determines the baseline performance
![Page 31: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/31.jpg)
5.4 SUMMARY
Weka makes it easy – ... maybe too easy?
There are many pitfalls– You need to understand what you’re doing!
filters Attribute selection
Data visualization classifiers clusters
![Page 32: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/32.jpg)
5.4 SUMMARY
Focus on evaluation ... and significance– Different algorithms differ in performance – but is it significant?
![Page 33: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/33.jpg)
![Page 34: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/34.jpg)
Advanced Datamining with Weka
Some missing parts in the lecturesFiltered ClassifierCost-sensitive evaluation and classificationAttribute selectionClustering Association rulesText classificationWeka Experimenter
![Page 35: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/35.jpg)
Filtered Classifier
Filter the training data, not testing data
Why do we need Filtered Classifier?
![Page 36: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/36.jpg)
Cost-sensitive evaluation and classification
Costs of different decisions and different kinds of errors
Costs in datamining Misclassification Costs Test Costs Costs of Cases Computation Costs
![Page 37: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/37.jpg)
Attribute Selection
Uesful parts of Attribute SelectionSelect relevant attributesRemove irrelevant attributesReasons for Attribute SelctionSimpler modelMore Transparent and easier to understandShorter Training timeKnowing which attribute is important
![Page 38: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/38.jpg)
Clustering
Cluster the instances according to their attribute values
Clustering method: k-means k-means++
![Page 39: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/39.jpg)
Experimenter
![Page 40: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/40.jpg)
Experimenter
![Page 41: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/41.jpg)
Experimenter
![Page 42: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/42.jpg)
Acknowledgement
Thanks Prof. Ian H. Witten and his Weka MOOC direction.
![Page 43: Data Mining with Weka Putting it all together](https://reader036.vdocuments.us/reader036/viewer/2022081505/568163e1550346895dd53cbf/html5/thumbnails/43.jpg)
Thank you for your kind attention