Transcript
Page 1: Open Source Software for Data Scientists -- BigConf 2014

Open Source Software for Data Scientists

Charlie Greenbacker, Director of Data Science 28 Mar 2014

Page 2: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Agenda

■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools:

¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization

Page 3: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable

photo: Columbia Pictures

Page 4: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Best reason for not finishing PhD

Page 5: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

@ExploreAltamira

Page 6: Open Source Software for Data Scientists -- BigConf 2014

What is a Data Scientist?

Page 7: Open Source Software for Data Scientists -- BigConf 2014
Page 8: Open Source Software for Data Scientists -- BigConf 2014
Page 9: Open Source Software for Data Scientists -- BigConf 2014
Page 10: Open Source Software for Data Scientists -- BigConf 2014

credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Page 11: Open Source Software for Data Scientists -- BigConf 2014

http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/

Paul Cooper, ITProPortal.com

“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”

Page 12: Open Source Software for Data Scientists -- BigConf 2014

Computer Programming

Mathematics & Analytic Methodology

Distributed Computing & Big Data

Data Science

Stat

istic

al A

naly

sis

Dat

a M

inin

g

Mac

hine

Lea

rnin

g

Nat

ural

Lan

guag

e Pr

oces

sing

Soci

al N

etw

ork

Ana

lysis

Dat

a V

isual

izat

ion

Domain Knowledge & Communication Skills

etc.

Altamira Technologies Corporation 2014

Page 13: Open Source Software for Data Scientists -- BigConf 2014

Why use Open Source Software?

Page 14: Open Source Software for Data Scientists -- BigConf 2014

photo: Karen (https://flic.kr/p/5njby2)

THERE ARE NO SILVER BULLETS."

Page 15: Open Source Software for Data Scientists -- BigConf 2014

photo: Paul Inkles (https://flic.kr/p/e2QMS5)

IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."

Page 16: Open Source Software for Data Scientists -- BigConf 2014

photo: Valugi (http://bit.ly/1jrvVBC)

BUDGETS DON’T SCALE."

Page 17: Open Source Software for Data Scientists -- BigConf 2014

Survey of OSS Tools

Page 18: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Statistical Analysis

■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features:

¤ Language & environment for statistical computing & viz ¤ Linear and nonlinear modeling, classical statistical tests,

time-series analysis, graphical techniques, and more… ¤ 5000+ packages available in CRAN repository

Page 19: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Mining

■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features:

¤ Data analysis workflow in Python ¤ DataFrame object for fast manipulation & indexing ¤ Tools for reading & writing data between formats ¤ Label-based slicing, indexing, and subsetting of data

Page 20: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Mining

■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features:

¤ MPP query engine implemented on Hadoop ¤ Low latency, high concurrency SQL & BI queries ¤ Same interfaces as Apache Hive, but ~24x faster ¤ Written in C++; does not use MapReduce

Page 21: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning

■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features:

¤ Distributed/scalable ML library for Hadoop ¤ Classification, Clustering, Collaborative filtering ¤ Logistic regression, naïve Bayes, random forest, neural

networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.

Page 22: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning

■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features:

¤ ML library for Python built on NumPy, SciPy, matplotlib ¤ Support for classification, clustering, dimensionality

reduction, regression, model selection, preprocessing ¤ SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...

Page 23: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning + NLP

■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features:

¤ Java-based “Machine Learning for Language Toolkit” ¤ Document classification, clustering, topic modeling,

information extraction & sequence tagging, etc. ¤ Efficient implementation of LDA for topic modeling

Page 24: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Natural Language Processing

■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features:

¤ Natural Language Toolkit for Python ¤ Built-in support for dozens of corpora & trained models ¤ Libraries for classification, tokenization, stemming,

tagging, parsing, and semantic reasoning

Page 25: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Natural Language Processing

■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features:

¤ Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser,

coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German

Page 26: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

NLP + Geospatial Analysis

■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features:

¤ Extracts location names from text, resolves to gazetteer ¤ Employs context-based geospatial entity resolution ¤ ~75% accuracy, processes 1M documents per hour ¤ Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org

Page 27: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Social Network Analysis

■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features:

¤ Network analysis and visualization package for Java ¤ Dynamic network analysis with temporal filtering ¤ Metrics include: community detection, betweenness,

closeness, clustering coefficient, PageRank, etc.

Page 28: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Visualization

■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features:

¤ JavaScript library based on HTML, SVG, and CSS ¤ Binds data to DOM & enables transformations ¤ ~200 examples, including: force-directed graphs,

choropleths, treemaps, dendrograms, animations, etc.

Page 29: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Fusion, Analysis, and Visualization

■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features:

¤ Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤ Cell-level security & access controls ¤ Live, shared collaborative workspaces

Page 30: Open Source Software for Data Scientists -- BigConf 2014
Page 31: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Final Thought…

Save your $$$ for: ¨  People

¤  salaries, training, etc.

¨  Resources ¤ hardware, AWS, etc.

¨  Proprietary software ¤  if no viable OSS

alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ)

FINAL THOUGHT

Springer’s

Page 32: Open Source Software for Data Scientists -- BigConf 2014

open source software for data scientists

oss4ds.com

Page 33: Open Source Software for Data Scientists -- BigConf 2014

Charlie Greenbacker | @greenbacker www.oss4ds.com


Top Related