doing data science with clojure

38
Doing data science with Clojure @sbelak [email protected] Curry On Rome, 2016

Upload: simon-belak

Post on 25-Jan-2017

708 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Doing data science with Clojure

Doing data science with Clojure

@sbelak [email protected]

Curry On Rome, 2016

Page 2: Doing data science with Clojure
Page 3: Doing data science with Clojure

↳ Design constrains

↳ The environment

↳ notebooks vs. REPL

↳ programmable environments

↳ The tools

↳ design decisions behind Huri (my data science library)

↳ data frame considered harmful

↳ encoding computation into structure

↳ composability

↳ feedback loops

↳ Expanding the ecosystem with mini compilers (to ggplot, scipy, …)

Page 4: Doing data science with Clojure

Design constraints

Page 5: Doing data science with Clojure
Page 6: Doing data science with Clojure

Divide and conquer complexity

Page 7: Doing data science with Clojure

KafkaPostgreSQL

ElasticSearch

frontend actions orderbook changes monitoring telemetry flight changes Intercom …

s3

Intercom

Page 8: Doing data science with Clojure

Automatic views

• Event & attribute ontology

• Manual

• Inferred

• Seasonality detection

Page 9: Doing data science with Clojure

Data science: the process

(aka it’s about communication, stupid!)

Page 10: Doing data science with Clojure

The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow

< 2min < 20min project

squeeze in somewhere in the day

fail

roadmapahoy!

Page 11: Doing data science with Clojure

Think in distributions, not numbers

Page 12: Doing data science with Clojure

No throwaways

Page 13: Doing data science with Clojure

Sharing results

• Have one canonical version that is always current.

• Concentrate discussion in one place and make it searchable and persistent.

• Include methodology (=code).

Page 14: Doing data science with Clojure

The environment

Page 15: Doing data science with Clojure

REPL vs. notebook

Page 16: Doing data science with Clojure

REPL vs. notebook+[Ephemeral] [Spital grouping]

Page 17: Doing data science with Clojure

(hacked) gorilla-repl.org +

auto-refresh +

hypothes.is

Page 18: Doing data science with Clojure

#alderaan #sales #growth

Page 19: Doing data science with Clojure

Code hidden, but can be expanded

Questions, comments,

& annotations

Shareable

Periodically re-run to keep it fresh

#alderaan #sales #growth

discoverability

Page 20: Doing data science with Clojure

Notebooks as dashboards

Page 21: Doing data science with Clojure

The power of sharing runtime

Page 22: Doing data science with Clojure

Wishlist/TODO

• Better editor (shaunlebron.github.io/parinfer/ ?)

• Embedded REPL

• Better exception reporting

• Browsable data structures

Page 23: Doing data science with Clojure

The tools

Page 24: Doing data science with Clojure
Page 25: Doing data science with Clojure

Data frame considered harmful

• Data frame (=table) conflates representation and abstraction

• Clojure excels in structure manipulation/encoding

Page 26: Doing data science with Clojure

github.com/sbelak/huri• No data structures, just functions over collections

• Composable (even DSLs — no macros!)

• Reasonably fast (transducers <3)

• Do-what-I-mean (auto-sort, liberal with inputs, …)

• Minimal buy-in

Page 27: Doing data science with Clojure

composable data structure based DSLs

->> and partial friendly Support reaching into nested structures everywhere

vanilla vector of maps

interoperability

Provide curried versions where possible

Page 28: Doing data science with Clojure

Composability is key to quick iterating

• Curried versions where possible

• ->> and partial friendly

• Side benefit: consistent API

• Generalised accessors (reaching into complex structures everywhere via comp)

function

map key

“virtual” structure

Page 29: Doing data science with Clojure

“This is possibly Clojure’s most important property: the syntax expresses the code’s semantic layers. An experienced reader of Clojure can skip over most of the code and have a lossless understanding of its high-level intent.”

— Z. Tellman, Elements of Clojure

Page 30: Doing data science with Clojure

On feedback

Page 31: Doing data science with Clojure

Catching errors early ⇒ more context ⇒ easier debugging ⇒ faster iterating

Page 32: Doing data science with Clojure

clojure.spec

=>

Should have been a keyword->fn map

Page 33: Doing data science with Clojure

<3 Bret Victor

Page 34: Doing data science with Clojure

What about machine learning?

farm it out to sklearn

Page 35: Doing data science with Clojure

Mini compilers for DSLs targeting a specific library in another language

Page 36: Doing data science with Clojure

huri.plot

• DSL that compiles to ggplot2

• Targets Gorilla REPL

• Follows the rest of Huri’s design philosophy

• bar chart, scatter plot, line chart, box & violin plot, heatmap, histogram

Page 37: Doing data science with Clojure
Page 38: Doing data science with Clojure

Takeouts• Speed-of-answer matters

• Data science is about communication

• We don’t have to reinvent every wheel in Clojure

• Clojure is fantastic at structure manipulation, play to its strengths

• Blurring the line between environment and work is a powerful idea