oscon miller 2011
Post on 16-May-2015
900 Views
Preview:
DESCRIPTION
TRANSCRIPT
Mike Miller_milleratmitJuly 25, 2011
Bayes on your (Big)Couch
Mike Miller, Oscon 2011 2
I want my app to do _this_
Mike Miller, Oscon 2011 3
CouchDB in a slide• Schema-free document database management system
Documents are JSON objectsAble to store binary attachments
• RESTful APIhttp://wiki.apache.org/couchdb/reference
• Views: Custom, persistent representations of your dataIncremental MapReduce with results persisted to diskFast querying by primary key (views stored in a B-tree)
• Bi-Directional ReplicationMaster-slave and multi-master topologies supportedOptional ‘filters’ to replicate a subset of the dataEdge devices (mobile phones, sensors, etc.)
Mike Miller, Oscon 2011 4
BigCouch = Couch+Scaling• Open Source, Apache License
• Horizontal ScalabilityEasily add storage capacity by adding more serversComputing power (views, compaction, etc.) scales with more servers
• No SPOFAny node can handle any requestIndividual nodes can come and go
• Transparent to the ApplicationAll clustering operations take place “behind the curtain”looks (mostly) like a single server instance of CouchDB
Mike Miller, Oscon 2011 5
...back to making my app smart
Mike Miller, Oscon 2011
Sample Data
6
Weight [lbs]80 100 120 140 160 180 200 220
Hei
ght [
in]
35
40
45
50
55
60
65
70
75
80
Height vs. Weight
GirlsBoys
Height vs. Weight
Mike Miller, Oscon 2011
Naive Bayes Classifier
7
-3 -2 -1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
gaus
height
male
mean male height
male height variance
Mike Miller, Oscon 2011
Implementation Plan
8
Weight [lbs]80 100 120 140 160 180 200 220
Hei
ght [
in]
35
40
45
50
55
60
65
70
75
80
Height vs. Weight
GirlsBoys
Height vs. Weight
Model people as documents in CouchDB
Calculate Means/Variances with MapReduce
Run classifier in the CouchDB as post-MapReduce hook (“_list”)
• Note:do not need to specify fields to use in classificationmulti-class implementation continuous, incremental training! Results improve as training data trickles in.
Mike Miller, Oscon 2011
3 ways to follow along
couchapp python tool to push/pull from other couchdb’s> sudo easy_install install -U couchapp> couchapp clone ‘http://millertime.cloudant.com/bitb'create an account at cloudant.com> curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’> couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’github> git clone git@github.com:mlmiller/bayes.git
CouchDB replication to your cloudant accountbonus, brings along the data, too!
9
Mike Miller, Oscon 2011
The Code
10
Classifier (Probability Calculator)
view code to calculate means and variances
post MapReduce Hook (“_list”
method)
you can ignore everything else
client side test via node.js
Mike Miller, Oscon 2011
Data Model
11
‘class’ => training Data
Arbitrary number of numerical fields allowed
Mike Miller, Oscon 2011
Training via MapReduce
12
‘class’ => training Data
Calculate mean/variance for all numerical fields in a document
emit: ([<class>, <field>], <value>)
Reduce: _stats (Erlang builtin)
views/training/map.js
Mike Miller, Oscon 2011
Bayes: Trained State
13
pre-reduce output
Mike Miller, Oscon 2011
Bayes: Trained State
14
Count, Min, Max, Mean, Variance
Automatically Updated as new training Data Arrives
Mike Miller, Oscon 2011
Bayes Classifier
15
Load state from DB
No assumptions on Field Names
Calculate prob. for all possible hypotheses
lib/bayes_classifier.js
Mike Miller, Oscon 2011
A brief aside...
• Lets test our classifierSelect 2000 documents for testRandomly choose 1000 documents for training sampleRemaining documents used for validation
• Simulate continuous trainingAdd documents one at a timeAfter each document addition, test on all 1000 of our validation sampleRecord and plot fraction of validation sample properly classified
16
Mike Miller, Oscon 2011
A brief aside...
17
Number of documents in the training set
Dramatic improvement with additional training data
Mike Miller, Oscon 2011
... and back to the code
18
Mike Miller, Oscon 2011
test it yourself
19
• Client side test via node.js > ./test.js height=<some number> weigth=<some number>Classifier runs server side, configured in line 6 of test.js
Can point this to your DB
Mike Miller, Oscon 2011
Running as CouchApp
20
http://millertime.cloudant.com/bitb/_design/bayes/index.html
create a database (e.g., ‘bitb’) at cloudant.comadd datathen push your code>couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’HTML & CSS served directly from BigCouch to the browserHeavy lifting of classification done server side
Mike Miller, Oscon 2011
Running as API (_list)
21
> curl 'http://millertime.cloudant.com/bitb/_design/bayes/_list/index/training?
height=65.65&weight=168.61&format=json&group=true'
Mike Miller, Oscon 2011
Wrapping Up: Bayes on BigCouch• Simple code, powerful results
light requirements on data modelcan be relaxed with more complex view codeContinuous learning is very powerfule.g., time-based learning (automatically adapt to changing conditions)Classification can be performed client- or server-sidepush documents into DB and they are auto-tagged!More sophisticated classifiers easily implementede.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etcView Engine allows simple deployment of sophisticated domain libraries in mass parallele.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..
22
Mike Miller, Oscon 2011 23
Give it a spin
Hosting, Management, Support for CouchDB and BigCouchhttp://cloudant.com
http://github.com/cloudant/bigcouch
top related