data science primer documentation

115
Data Science Primer Documentation Team Mar 19, 2019

Upload: others

Post on 18-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science Primer Documentation

Data Science Primer Documentation

Team

Mar 19 2019

Technical

1 Bash 3

2 Big Data 5

3 Databases 7

4 Data Engineering 11

5 Data Wrangling 13

6 Data Visualization 15

7 Deep Learning 17

8 Machine Learning 19

9 Python 21

10 Statistics 23

11 SQL 25

12 Glossary 27

13 Business 31

14 Ethics 33

15 Mastering The Data Science Interview 35

16 Learning How To Learn 43

17 Communication 47

18 Product 49

19 Stakeholder Management 51

20 Datasets 53

i

21 Libraries 69

22 Papers 99

23 Other Content 105

24 Contribute 111

ii

Data Science Primer Documentation

The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associatedwith Data Science Typically educational resources focus on technical topics However in reality mastering thenon-technical topics can lead to even greater dividends in your professional capacity as a data scientist

Since the task of a Data Scientist at any company can be vastly different (Inference Analytics and Algorithms ac-cording to Airbnb) therersquos no real roadmap to using this primer Feel free to jump around to topics that interest youand which may align closer with your career trajectory

Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may beexpected to have proficiency in Itrsquos not meant to serve as a replacement for fundamental courses that contribute to aData Science education nor as a way to gain mastery in a topic

Warning This document is under early stage development If you find errors please raise an issue or contributea better definition

Technical 1

Data Science Primer Documentation

2 Technical

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 2: Data Science Primer Documentation

Technical

1 Bash 3

2 Big Data 5

3 Databases 7

4 Data Engineering 11

5 Data Wrangling 13

6 Data Visualization 15

7 Deep Learning 17

8 Machine Learning 19

9 Python 21

10 Statistics 23

11 SQL 25

12 Glossary 27

13 Business 31

14 Ethics 33

15 Mastering The Data Science Interview 35

16 Learning How To Learn 43

17 Communication 47

18 Product 49

19 Stakeholder Management 51

20 Datasets 53

i

21 Libraries 69

22 Papers 99

23 Other Content 105

24 Contribute 111

ii

Data Science Primer Documentation

The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associatedwith Data Science Typically educational resources focus on technical topics However in reality mastering thenon-technical topics can lead to even greater dividends in your professional capacity as a data scientist

Since the task of a Data Scientist at any company can be vastly different (Inference Analytics and Algorithms ac-cording to Airbnb) therersquos no real roadmap to using this primer Feel free to jump around to topics that interest youand which may align closer with your career trajectory

Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may beexpected to have proficiency in Itrsquos not meant to serve as a replacement for fundamental courses that contribute to aData Science education nor as a way to gain mastery in a topic

Warning This document is under early stage development If you find errors please raise an issue or contributea better definition

Technical 1

Data Science Primer Documentation

2 Technical

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 3: Data Science Primer Documentation

21 Libraries 69

22 Papers 99

23 Other Content 105

24 Contribute 111

ii

Data Science Primer Documentation

The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associatedwith Data Science Typically educational resources focus on technical topics However in reality mastering thenon-technical topics can lead to even greater dividends in your professional capacity as a data scientist

Since the task of a Data Scientist at any company can be vastly different (Inference Analytics and Algorithms ac-cording to Airbnb) therersquos no real roadmap to using this primer Feel free to jump around to topics that interest youand which may align closer with your career trajectory

Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may beexpected to have proficiency in Itrsquos not meant to serve as a replacement for fundamental courses that contribute to aData Science education nor as a way to gain mastery in a topic

Warning This document is under early stage development If you find errors please raise an issue or contributea better definition

Technical 1

Data Science Primer Documentation

2 Technical

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 4: Data Science Primer Documentation

Data Science Primer Documentation

The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associatedwith Data Science Typically educational resources focus on technical topics However in reality mastering thenon-technical topics can lead to even greater dividends in your professional capacity as a data scientist

Since the task of a Data Scientist at any company can be vastly different (Inference Analytics and Algorithms ac-cording to Airbnb) therersquos no real roadmap to using this primer Feel free to jump around to topics that interest youand which may align closer with your career trajectory

Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may beexpected to have proficiency in Itrsquos not meant to serve as a replacement for fundamental courses that contribute to aData Science education nor as a way to gain mastery in a topic

Warning This document is under early stage development If you find errors please raise an issue or contributea better definition

Technical 1

Data Science Primer Documentation

2 Technical

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 5: Data Science Primer Documentation

Data Science Primer Documentation

2 Technical

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 6: Data Science Primer Documentation

CHAPTER 1

Bash

bull Introduction

bull Subj_1

11 Introduction

xx

12 Subj_1

Code

Some code

print(hello)

References

3

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 7: Data Science Primer Documentation

Data Science Primer Documentation

4 Chapter 1 Bash

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 8: Data Science Primer Documentation

CHAPTER 2

Big Data

bull Introduction

bull Subj_1

httpswwwquoracomWhat-is-a-Hadoop-ecosystem

21 Introduction

The tooling and methodologies to deal with the phenonmon of big data was originally created in response to theexponential growth in data generation (TODO INSERT quote about data generation) When talking about big datawe mean (insert what we mean)

Wersquoll be focusing exclusively on the Apache Hadoop Ecosystem which is [insert something] It contains manydifferent libraries that help achieve different aims

5

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 9: Data Science Primer Documentation

Data Science Primer Documentation

Why Itrsquos Important

While a data scientist may not specifically be involved with the specifics of managing big data and performing ExtractTransform and Load tasks theyrsquoll likely be responsible for querying sources where this data is stored at a minimumIdeally a data scientist should be familar with how this data is stored how to query it and how to transform it (egSpark Hadoop Hive would be three key technologies to be familar with)

22 Subj_1

Code

Some code

print(hello)

References

6 Chapter 2 Big Data

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 10: Data Science Primer Documentation

CHAPTER 3

Databases

bull Introduction

bull Relational

ndash Types

ndash Advantages

ndash Disadvantages

bull Non-relationalNoSQL

ndash Types

ndash Advantages

ndash Disadvantages

bull Terminology

31 Introduction

Databases in the most basic terms serve as a way of storing organizing and retrieving information The two mainkinds of databases are Relational and Non-Relational (NoSQL) Within both kinds of databases are a variety of otherdatabase sub-types which the following sections will explore in detail

Why Itrsquos Important

Often times a data scientist will have to work with a database in order to retrieve the data they need to build theirmodels or to do data analysis Thus itrsquos important that a scientist familiarize themselves with the different types ofdatabases and how to query them

7

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 11: Data Science Primer Documentation

Data Science Primer Documentation

32 Relational

A relational database is a collection of tables and the relationships that exist between these tables These tables containcolumns (table attributes) with rows that represent the data within the table You can think of the column as the headerindicating the name of the data being stored and the row being the actual data points that are stored in the databaseOften times these rows will have a unique identifier associated with them called primary keys Relationships betweentables are created using foreign keys whose primary function is to join tables together

To get data out of a relational database you would use SQL with different relational databases using slightly differentversions of SQL When interacting with the database using SQL whether creating a table or querying it you arecreating what is called a Transaction Transactions in relational databases represent a unit of work performed againstthe database

An important concept with transactions in relational databases is maintaining ACID which stands for

bull Atomicity A transaction typically contains many queries Atomicity guarantees that if one query fails then thewhole transaction fails leaving the database unchanged

bull Consistency This ensures that a transaction can only bring a database from one valid state to another meaninga transaction cannot leave the database in a corrupt state

bull Isolation Since transactions are executed concurrently this property ensures that the database is left in the samestate as if the transactions were executed sequentially

bull Durability This property gurantees that once a transaction has been committed its effects will remain regardlessof any system failure

321 Types

While there are many different kinds of relational databases the most popular are

bull PostgreSQL An open-source object-relational database management system that is ACID-compliant and trans-actional

bull Oracle A multi-model database management system that is produced and marketed by Oracle

bull MySQL An open-source relational database management system with a proprietary paid version available foradditional functionality

bull Microsoft SQL Server A relational database management system developed by Microsoft

bull Maria DB A MySQL compabtable database engine forked from MySQL

322 Advantages

bull SQL standards are well defined and commonly accepted

bull Easy to categorize and store data that can later be queried

bull Simple to understand since the table structure and relationships are intuitive ato most users

bull Data integrity through strong data typing and validity checks to make sure that data falls within an acceptablerange

323 Disadvantages

bull Poor performance with unstructured data types due to schema and type constraints

8 Chapter 3 Databases

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 12: Data Science Primer Documentation

Data Science Primer Documentation

bull Can be slow and not scalable compared to NoSQL

bull Unable to map certain kinds of data such as graphs

33 Non-relationalNoSQL

Non-relational databases were developed from a need to deal with the exponential growth in data that was beinggathered and processed Additionally dealing with scalability multi-structured data and geo-distribution were a fewmore reasons that NoSQL was created There are a variety of NoSQL implementations each with their own approachto tackling these problems

331 Types

bull Key-value Store One of the simpler NoSQL stores that works by assigning a value for each key The databasethen uses a hash table to store unique keys and pointers for each data value They can be used to store usersession data and examples of these types of databases include Redis and Amazon Dynamo

bull Document Store Similar to a key-value store however in this case the value contains structured or semi-structured data and is referred to as a document A use case for document store could be for a blogging platformExamples of document stores include MongoDB and Apache CouchDB

bull Column Store In this implementation data is stored in cells grouped in columns of data rather than rows ofdata with each column being grouped into a column family These can be used in content management systemsand examples of these types of databases include Cassandra and Apache Hbase

bull Graph Store These are based on the Entity - Attribute - Value model Entities will have associated attributes andsubsequent values when data is inserted Nodes will store data about each entity along with the relationshipsbetween nodes Graph stores can be used in applications such as social networks and examples of these typesof databases include Neo4j and ArangoDB

332 Advantages

bull High availability

bull Schema free or schema-on-read options

bull Ability to rapidly prototype applications

bull Elastic scalability

bull Can store massive amounts of data

333 Disadvantages

bull Since most NoSQL databases use eventual consistency instead of ACID there may be a risk that data may beout of sync

bull Less support and maturity in the NoSQL ecosystem

34 Terminology

Query A query can be thought of as a single action that is taken on a database

33 Non-relationalNoSQL 9

Data Science Primer Documentation

Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database

ACID Atomicity Consistency Isolation Durability

Schema A schema is the structure of a database

Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactionsas well as data stored The two main types are vertical scalability which is concerned with adding more capacityto a single machine by adding additional RAM CPU etc Horizontal scalability has to do with adding moremachines and splitting the work amongst them

Normalization This is a technique of organizing tables within a relational database It involves splitting up data intoseperate tables to reduce redundancy and improve data integrity

Denormalization This is a technique of organizing tables within a relational database It involves combining tablesto reduce the number of JOIN queries

References

bull httpsdzonecomarticlesthe-types-of-modern-databases

bull httpswwwmongodbcomnosql-explained

bull httpswwwchannelfuturescomcloud-2the-limitations-of-nosql-database-storage-why-nosqls-not-perfect

bull httpsopensourceforucom201705different-types-nosql-databases

10 Chapter 3 Databases

CHAPTER 4

Data Engineering

bull Introduction

bull Subj_1

41 Introduction

xxx

42 Subj_1

Code

Some code

print(hello)

References

11

Data Science Primer Documentation

12 Chapter 4 Data Engineering

CHAPTER 5

Data Wrangling

bull Introduction

bull Subj_1

51 Introduction

xxx

52 Subj_1

Code

Some code

print(hello)

References

13

Data Science Primer Documentation

14 Chapter 5 Data Wrangling

CHAPTER 6

Data Visualization

bull Introduction

bull Subj_1

61 Introduction

xxx

62 Subj_1

Code

Some code

print(hello)

References

15

Data Science Primer Documentation

16 Chapter 6 Data Visualization

CHAPTER 7

Deep Learning

bull Introduction

bull Subj_1

71 Introduction

xxx

72 Subj_1

Code

Some code

print(hello)

References

17

Data Science Primer Documentation

18 Chapter 7 Deep Learning

CHAPTER 8

Machine Learning

bull Introduction

bull Subj_1

81 Introduction

xxx

82 Subj_1

Code

Some code

print(hello)

References

19

Data Science Primer Documentation

20 Chapter 8 Machine Learning

CHAPTER 9

Python

bull Introduction

bull Subj_1

91 Introduction

xxx

92 Subj_1

Code

Some code

print(hello)

References

21

Data Science Primer Documentation

22 Chapter 9 Python

CHAPTER 10

Statistics

Basic concepts in statistics for machine learning

References

23

Data Science Primer Documentation

24 Chapter 10 Statistics

CHAPTER 11

SQL

bull Introduction

bull Subj_1

111 Introduction

xxx

112 Subj_1

Code

Some code

print(hello)

References

25

Data Science Primer Documentation

26 Chapter 11 SQL

CHAPTER 12

Glossary

Definitions of common machine learning terms

Accuracy Percentage of correct predictions made by the model

Algorithm A method function or series of instructions used to generate a machine learning model Examples includelinear regression decision trees support vector machines and neural networks

Attribute A quality describing an observation (eg color size weight) In Excel terms these are column headers

Bias metric What is the average difference between your predictions and the correct value for that observation

bull Low bias could mean every prediction is correct It could also mean half of your predictions are abovetheir actual values and half are below in equal proportion resulting in low average difference

bull High bias (with low variance) suggests your model may be underfitting and yoursquore using the wrong archi-tecture for the job

Bias term Allow models to represent patterns that do not pass through the origin For example if all my featureswere 0 would my output also be zero Is it possible there is some base value upon which my features have aneffect Bias terms typically accompany weights and are attached to neurons or filters

Categorical Variables Variables with a discrete set of possible values Can be ordinal (order matters) or nominal(order doesnrsquot matter)

Classification Predicting a categorical output (eg yes or no blue green or red)

Classification Threshold The lowest probability value at which wersquore comfortable asserting a positive classificationFor example if the predicted probability of being diabetic is gt 50 return True otherwise return False

Clustering Unsupervised grouping of data into buckets

Confusion Matrix Table that describes the performance of a classification model by grouping predictions into 4categories

bull True Positives we correctly predicted they do have diabetes

bull True Negatives we correctly predicted they donrsquot have diabetes

bull False Positives we incorrectly predicted they do have diabetes (Type I error)

27

Data Science Primer Documentation

bull False Negatives we incorrectly predicted they donrsquot have diabetes (Type II error)

Continuous Variables Variables with a range of possible values defined by a number scale (eg sales lifespan)

Deduction A top-down approach to answering questions or solving problems A logic technique that starts with atheory and tests that theory with observations to derive a conclusion Eg We suspect X but we need to test ourhypothesis before coming to any conclusions

Deep Learning Deep Learning is derived from one machine learning algorithm called perceptron or malti layer per-ceptron that gain more and more attention nowadays because of its success in different fields like computervision to signal processing and medical diagnosis to self-driving cars As all other AI algorithms deep learn-ing is from decades but now today we have more and more data and cheap computing power that make thisalgorithm really powerful to achive state of the art acuracy in modern world this algorithm knowns as artifi-cial neural network deep learning is much more than traditional artificial neural network but it was highlyinfluenced by machine learningrsquos neural network and perceptron network

Dimension Dimention for machine learning and data scientist is differ from physics here Dimention of data meanshow much feature you have in you data ocean(data-set) eg in case of object detection application flatten imagesize and color channel(eg 28283) is a feature of the input set in case of house price pridiction (maybe) housesize is the data-set so we call it 1 dimentional data

Epoch An epoch describes the number of times the algorithm sees the entire data set

Extrapolation Making predictions outside the range of a dataset Eg My dog barks so all dogs must bark Inmachine learning we often run into trouble when we extrapolate outside the range of our training data

Feature With respect to a dataset a feature represents an attribute and value combination Color is an attributeldquoColor is bluerdquo is a feature In Excel terms features are similar to cells The term feature has other definitionsin different contexts

Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a MachineLearning model

Feature Vector A list of features describing an observation with multiple attributes In Excel we call this a row

Hyperparameters Hyperparameters are higher-level properties of a model such as how fast it can learn (learningrate) or complexity of a model The depth of trees in a Decision Tree or number of hidden layers in a NeuralNetworks are examples of hyper parameters

Induction A bottoms-up approach to answering questions or solving problems A logic technique that goes fromobservations to theory Eg We keep observing X so we ltbgtltigtinferltigtltbgt that Y must be True

Instance A data point row or sample in a dataset Another term for observation

Learning Rate The size of the update steps to take during optimization loops like gradient_descent With a highlearning rate we can cover more ground each step but we risk overshooting the lowest point since the slope ofthe hill is constantly changing With a very low learning rate we can confidently move in the direction of thenegative gradient since we are recalculating it so frequently A low learning rate is more precise but calculatingthe gradient is time-consuming so it will take us a very long time to get to the bottom

Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss the better a model (un-less the model has over-fitted to the training data) The loss is calculated on training and validation and itsinterperation is how well the model is doing for these two sets Unlike accuracy loss is not a percentage It is asummation of the errors made for each example in training or validation sets

Machine Learning Contribute a definition

Model A data structure that stores a representation of a dataset (weights and biases) Models are createdlearned whenyou train an algorithm on a dataset

Neural Networks Contribute a definition

Normalization Contribute a definition

28 Chapter 12 Glossary

Data Science Primer Documentation

Null Accuracy Baseline accuracy that can be acheived by always predicting the most frequent class (ldquoB has thehighest frequency so lets guess B every timerdquo)

Observation A data point row or sample in a dataset Another term for instance

Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noisespecific to your dataset You can tell a model is overfitting when it performs great on your trainingvalidationset but poorly on your test set (or new real-world data)

Parameters Be the first to contribute

Precision In the context of binary classification (YesNo) precision measures the modelrsquos performance at classifyingpositive observations (ie ldquoYesrdquo) In other words when a positive value is predicted how often is the predictioncorrect We could game this metric by only returning positive for the single observation we are most confidentin

119875 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Recall Also called sensitivity In the context of binary classification (YesNo) recall measures how ldquosensitiverdquo theclassifier is at detecting positive instances In other words for all the true observations in our sample how manydid we ldquocatchrdquo We could game this metric by always classifying observations as positive

119877 =119879119903119906119890119875119900119904119894119905119894119907119890119904

119879119903119906119890119875119900119904119894119905119894119907119890119904+ 119865119886119897119904119890119873119890119892119886119905119894119907119890119904

Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) ornot (False) We feed it into our model and our model starts guessing

bull Precision is the of True guesses that were actually correct If we guess 1 image is True out of 100images and that image is actually True then our precision is 100 Our results arenrsquot helpful howeverbecause we missed 10 brain tumors We were super precise when we tried but we didnrsquot try hard enough

bull Recall or Sensitivity provides another lens which with to view how good our model is Again letrsquos saythere are 100 images 10 with brain tumors and we correctly guessed 1 had a brain tumor Precision is100 but recall is 10 Perfect recall requires that we catch all 10 tumors

Regression Predicting a continuous output (eg price sales)

Regularization Contribute a definition

Reinforcement Learning Training a model to maximize a reward via iterative trial and error

Segmentation Contribute a definition

Specificity In the context of binary classification (YesNo) specificity measures the modelrsquos performance at classi-fying negative observations (ie ldquoNordquo) In other words when the correct label is negative how often is theprediction correct We could game this metric if we predict everything as negative

119878 =119879119903119906119890119873119890119892119886119905119894119907119890119904

119879119903119906119890119873119890119892119886119905119894119907119890119904+ 119865119886119897119904119890119875119900119904119894119905119894119907119890119904

Supervised Learning Training a model using a labeled dataset

Test Set A set of observations used at the end of model training and validation to assess the predictive power of yourmodel How generalizable is your model to unseen data

Training Set A set of observations used to generate machine learning models

Transfer Learning Contribute a definition

Type 1 Error False Positives Consider a company optimizing hiring practices to reduce false positives in job offersA type 1 error occurs when candidate seems good and they hire him but he is actually bad

29

Data Science Primer Documentation

Type 2 Error False Negatives The candidate was great but the company passed on him

Underfitting Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations inyour data that would give your model more predictive power You can tell a model is underfitting when itperforms poorly on both training and test sets

Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-tion but only for inputs in a specific range If you train a network on inputs between -2 and 2 then it will workwell for inputs in the same range but you canrsquot expect it to generalize to other inputs without retraining themodel or adding more hidden neurons

Unsupervised Learning Training a model to find patterns in an unlabeled dataset (eg clustering)

Validation Set A set of observations used during model training to provide feedback on how well the current param-eters generalize beyond the training set If training error decreases but validation error increases your model islikely overfitting and you should pause training

Variance How tightly packed are your predictions for a particular observation relative to each other

bull Low variance suggests your model is internally consistent with predictions varying little from each otherafter every iteration

bull High variance (with low bias) suggests your model may be overfitting and reading too deeply into thenoise found in every training set

References

30 Chapter 12 Glossary

CHAPTER 13

Business

bull Introduction

bull Subj_1

131 Introduction

xxx

132 Subj_1

Code

Some code

print(hello)

References

31

Data Science Primer Documentation

32 Chapter 13 Business

CHAPTER 14

Ethics

bull Introduction

bull Subj_1

141 Introduction

xxx

142 Subj_1

xxx

References

33

Data Science Primer Documentation

34 Chapter 14 Ethics

CHAPTER 15

Mastering The Data Science Interview

bull Introduction

bull The Coding Challenge

bull The HR Screen

bull The Technical Call

bull The Take Home Project

bull The Onsite

bull The Offer And Negotiation

151 Introduction

In 2012 Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century Since thenthe hype around data science has only grown Recent reports have shown that demand for data scientists far exceedsthe supply

However the reality is most of these jobs are for those who already have experience Entry level data science jobs onthe other hand are extremely competitive due to the supplydemand dynamics Data scientists come from all kinds ofbackgrounds ranging from social sciences to traditional computer science backgrounds Many people also see datascience as a chance to rebrand themselves which results in a huge influx of people looking to land their first role

To make matters more complicated unlike software development positions which have more standardized interviewprocesses data science interviews can have huge variations This is partly because as an industry there still isnrsquot anagreed upon definition of a data scientist Airbnb recognized this and decided to split their data scientists into threepaths Algorithms Inference and Analytics

35

Data Science Primer Documentation

So before starting to search for a role itrsquos important to determine what flavor of data science appeals to you Basedon your response to that what you study and what questions yoursquoll be asked will vary Despite the differences in thetypes generally speaking theyrsquoll follow a similar interview loop although the particular questions asked may varyIn this article wersquoll explore what to expect at each step of the interview process along with some tips and ways toprepare If yoursquore looking for a list of data science questions that may come up in an interview you should considerreading this and this

152 The Coding Challenge

Coding challenges can range from a simple Fizzbuzz question to more complicated problems like building a time seriesforecasting model using messy data These challenges will be timed (ranging anywhere from 30mins to one week)based on how complicated the questions are Challenges can be hosted on sites such as HackerRank CoderByte andeven internal company solutions

36 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

More often than not yoursquoll be provided with written test cases that will tell you if yoursquove passed or failed a questionThis will typically consider both correctness as well as complexity (ie how long did it take to run your code) Ifyoursquore not provided with tests itrsquos a good idea to write your own With data science coding challenges you may evenencounter multiple-choice questions on statistics so make sure you ask your recruiter what exactly yoursquoll be tested on

When yoursquore doing a coding challenge itrsquos important to keep in mind that companies arenrsquot always looking for thelsquocorrectrsquo solution They may also be looking for code readability good design or even a specific optimal solutionSo donrsquot take it personally when even after passing all the test cases you didnrsquot get to the next stage in the interviewprocess

Preparation 1 Practice questions on Leetcode which has both SQL and traditional data structuresalgorithm questions

2 Review Brilliant for math and statistics questions

3 SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser

Tips 1 Before you start coding read through all the questions This allows your unconscious mind to start workingon problems in the background

2 Start with the hardest problem first when you hit a snag move to the simpler problem before returning to theharder one

3 Focus on passing all the test cases first then worry about improving complexity and readability

4 If yoursquore done and have a few minutes left go get a drink and try to clear your head Read through your solutionsone last time then submit

5 Itrsquos okay to not finish a coding challenge Sometimes companies will create unreasonably tedious coding chal-lenges with one-week time limits that require 5ndash10 hours to complete Unless yoursquore desperate you can alwayswalk away and spend your time preparing for the next interview

153 The HR Screen

HR screens will consist of behavioral questions asking you to explain certain parts of your resume why you wantedto apply to this company and examples of when you may have had to deal with a particular situation in the workplaceOccasionally you may be asked a couple of simple technical questions perhaps a SQL or a basic computer sciencetheory question Afterward yoursquoll be given a few minutes to ask questions of your own

Keep in mind the person yoursquore speaking to is unlikely to be technical so they may not have a deep understanding ofthe role or the technical side of the organization With that in mind try to keep your questions focused on the companythe personrsquos experience there and logistical questions like how the interview loop typically runs If you have specificquestions they canrsquot answer you can always ask the recruiter to forward your questions to someone who can answerthem

Remember interviews are a two-way street so it would be in your best interest to identify any red flags before com-mitting more time to the interviewing with this particular company

Preparation 1 Read the role and company description

2 Look up who your interview is going to be and try to find areas of rapport Perhaps you both worked in aparticular city or volunteer at similar nonprofits

3 Read over your resume before getting on the call

Tips 1 Come prepared with questions

2 Keep your resume in clear view

3 Find a quiet space to take the interview If thatrsquos not possible reschedule the interview

153 The HR Screen 37

Data Science Primer Documentation

4 Focus on building rapport in the first few minutes of the call If the recruiter wants to spend the first few minutestalking about last nights basketball game let them

5 Donrsquot bad mouth your current or past companies Even if the place you worked at was terrible it rarely willbenefit you

154 The Technical Call

At this stage of the interview process yoursquoll have an opportunity to be interviewed by a technical member of the teamCalls such as these are typically conducted using platforms such as Coderpad which includes a code editor along witha way to run your code Occasionally you may be asked to write code in a Google doc Thus you should be comfortablecoding without any syntax highlighting or code completion Language-wise Python and SQL are typically the twothat yoursquoll be asked to write in however this can differ based on the role and company

Questions at this stage can range in complexity from a simple SQL question solved with a windows function toproblems involving Dynamic Programming Regardless of the difficulty you should always ask clarifying questionsbefore starting to code Once you have a good understanding of the problem and expectations start with a brute-forcesolution so that you have at least something to work with However make sure you tell your interviewer that yoursquoresolving it first in a non-optimal way before thinking about optimization After you have something working startto optimize your solution and make your code more readable Throughout the process itrsquos helpful to verbalize yourapproach since interviewers may occasionally help guide you in the right direction

If you have a few minutes at the end of the interview take advantage of the fact that yoursquore speaking to a technicalmember of the team Ask them about coding standards and processes how the team handles work and what their dayto day looks like

Preparation 1 If the data science position yoursquore interviewing for is part of the engineering organization makesure to read Cracking The Coding Interview and Elements of Programming Interviews since you may have a softwareengineer conducting the technical screen

1 Flashcards are typically the best way to review machine learning theory which may come up at this stage Youcan either make your own or purchase this set for $12 The Machine Learning Cheatsheet is also a good resourceto review

2 Look at Glassdoor to get some insight into the type of questions that may come up

3 Research who is going to interview you A machine learning engineer with a PhD will interview you differentlythan a data analyst

Tips 1 Itrsquos okay to ask for help if yoursquore stuck

2 Practice mock technical calls with a friend or use a platform like interviewingio

3 Donrsquot be afraid to ask for a minute or two to think about a problem before you start solving it Once you dostart itrsquos important to walk your interviewer through your approach

155 The Take Home Project

Take homersquos have been rising in popularity within data science interview loops since they tend to be more closely tiedwith what yoursquoll be doing once you start working They can either occur after the first HR screen prior to a technicalscreen or serve as a deliverable for your onsite Companies may test you on your ability to work with ambiguity (egHerersquos a dataset find some insights and pitch to business stakeholders) or focused on a more concrete deliverable (egHerersquos some data build a classifier)

When possible try to ask clarifying questions to make sure you know what theyrsquore testing you on and who youraudience will be If the audience for your take home is business stakeholders itrsquos not a good idea to fill your slides

38 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

with technical jargon Instead focus on actionable insights and recommendations and leave the technical jargon forthe appendix

While all take homes may differ in their objectives the common denominator is that yoursquoll be receiving data fromthe company So regardless of what theyrsquove asked you to do the first step will always be Exploratory Data AnalysisLuckily there are some automated EDA solutions such as SpeedML Primarily what you want to do here is investigatepeculiarities in the data More often than not the company will have synthetically generated the data leaving specificeaster eggs for you to find (eg A power law distribution with customer revenue)

Once you finish your take-home try to get some feedback from friends or mentors Often if yoursquove been working ona take-home for long enough you may start to miss the forest for the trees so itrsquos always good to get feedback fromsomeone who doesnrsquot have the context you do

Preparation 1 Practice take-home challenges which you can either purchase from datamasked or by looking at theanswers without the questions on this Github repo

2 Brush up on libraries and tools that may help with your work For example SpeedML or Tableau for rapid datavisualization

Tips 1 Some companies deliberately provide a take-home that requires you to email them to get additional informa-tion so donrsquot be afraid to get in touch

2 A good take home can often offset any poor performance at an onsite The rationale being that despite notknowing how to solve a particular interview problem yoursquove demonstrated competency in solving problems thatthey may encounter on a daily basis So if given the choice between doing more Leetcode problems or polishingyour onsite presentation itrsquos worthwhile to focus on the latter

3 Make sure to save every onsite challenge you do You never know when you may need to reuse a component infuture challenges

4 Itrsquos okay to make assumptions as long as you state them Information asymmetry is a given in these situationsand itrsquos better to make an assumption than to continuously bombard your recruiter with questions

156 The Onsite

An onsite will consist of a series of interviews throughout the day including a lunch interview which is typicallyevaluating your lsquoculture fitrsquo

Itrsquos important to remember that any company that has gotten you to this stage wants to see you succeed Theyrsquove al-ready spent a significant amount of money and time interviewing candidates to narrow it down to the onsite candidatesso have some confidence in your abilities

Make sure to ask your recruiter for a list of people who will be interviewing you so that you have a chance to do someresearch beforehand If yoursquore interviewing with a director you should focus on preparing for higher level questionssuch as company strategy and culture On the other hand if yoursquore interviewing with a software engineer itrsquos likely thattheyrsquoll ask you to whiteboard a programming question As mentioned before the personrsquos background will influencethe type of questions theyrsquoll ask

Preparation 1 Read as much as you can about the company The company website CrunchBase Wikipedia recentnews articles Blind Glassdoor all serve as great resources for information gathering

2 Do some mock interviews with a friend who can give you feedback on any verbal tics you may exhibit or holesin your answers This is especially helpful if you have a take-home presentation that yoursquoll be giving at theon-site

3 Have stories prepared to common behavioural interview questions such as lsquoTell me about yourselfrdquo lsquoWhy thiscompanyrsquo lsquoTell me about a time you had to deal with a difficult colleaguerdquo

156 The Onsite 39

Data Science Primer Documentation

4 If you have any software engineers on your onsite day therersquos a good chance yoursquoll need to brush up on yourdata structures and algorithms

Tips 1 Donrsquot be too serious Most of these interviewers would rather be back at their desk working on their assignedprojects So try your best to make it a pleasant experience for your interviewer

2 Make sure to dress the part If yoursquore interviewing at an east coast fortune 500 itrsquos likely yoursquoll need to dressmuch more conservatively than if you were interviewing with a startup on the west coast

3 Take advantage of bathroom and water breaks to recompose yourself

4 Ask questions yoursquore actually interested in Yoursquore interviewing the company just as much as they are inter-viewing you

5 Send a short thank you note to your recruiter and hiring manager after the onsite

157 The Offer And Negotiation

Negotiating for many people may seem uncomfortable especially for those without previous industry experienceHowever the reality is that negotiating has almost no downside (as long as yoursquore polite about it) and lots of upside

Typically companies will inform you that theyrsquore planning on giving you an offer over the phone At this point it maybe tempting to commit and accept the offer on the spot Instead you should convey your excitement about the offerand ask that they give you some time to discuss it with your significant other or friend You can also be up front andtell them yoursquore still in the interview loop with a couple of other companies and that yoursquoll get back to them shortlySometimes these offers come with deadlines however these are often quite arbitrary and can be pushed by a simplerequest on your part

Your ability to negotiate ultimately rests on a variety of factors but the biggest one is optionality If you have twogreat offers in hand itrsquos much easier to negotiate because you have the optionality to walk away

When yoursquore negotiating there are various levers you can pull The three main ones are your base salary stock optionsand signingrelocation bonus Every company has a different policy which means some levers may be easier to pullthan others Generally speaking signingrelocation is the easiest to negotiate followed by stock options and then basesalary So if yoursquore in a weaker position ask for a higher signingrelocation bonus However if yoursquore in a strongposition it may be in your best interest to increase your base salary The reason being that not only will it act as ahigher multiplier when you get raises but it will also have an effect on company benefits such as 401k matching andemployee stock purchase plans That said each situation is different so make sure to reprioritize what you negotiate asnecessary

Preparation 1 One of the best resources on negotiation is an article written by Haseeb Qureshi that details how hewent from boot camp grad to receiving offers from Google Airbnb and many others

Tips 1 If you arenrsquot good at speaking on the fly it may be advantageous to let calls from recruiters go to voicemail soyou can compose yourself before you call them back Itrsquos highly unlikely that yoursquoll be getting a rejection call sincethose are typically done over email This means that when you do call them back you should mentally rehearse whatyoursquoll say when they inform you that they want to give you an offer

2 Show genuine excitement for the company Recruiters can sense when a candidate is only in it for the moneyand they may be less likely to help you out in the negotiating process

3 Always leave things off on a good note Even if you donrsquot accept an offer from a company itrsquos important to bepolite and candid with your recruiters The tech industry can be a surprisingly small place and your reputationmatters

4 Donrsquot reject other companies or stop interviewing until you have an actual offer in hand Verbal offers have ahistory of being retracted so donrsquot celebrate until you have something in writing

40 Chapter 15 Mastering The Data Science Interview

Data Science Primer Documentation

Remember interviewing is a skill that can be learned just like anything else Hopefully this article has given yousome insight on what to expect in a data science interview loop

The process also isnrsquot perfect and there will be times that you fail to impress an interviewer because you donrsquot possesssome obscure piece of knowledge However with repeated persistence and adequate preparation yoursquoll be able toland a data science job in no time

157 The Offer And Negotiation 41

Data Science Primer Documentation

42 Chapter 15 Mastering The Data Science Interview

CHAPTER 16

Learning How To Learn

bull Introduction

bull Upgrading Your Learning Toolbox

bull Common Learning Traps

bull Steps For Effective Learning

161 Introduction

More than any other skill the skill of learning and retaining information is of utmost importance for a data scientistDepending on your role you may be expected to be a domain expert on a variety of topics so learning quicklyand efficiently is in your best interest Data science is also a constantly evolving field with new frameworks andtechniques being developed A data scientist that is able to stay on top of trends and synthesize new information willbe an invaluable asset to any organization

162 Upgrading Your Learning Toolbox

Below are a handful of strategies that when applied can have a great impact on your ability to learn and retaininformation Theyrsquore best used in conjunction with one another

Recall This strategy consists of quizzing yourself on topics yoursquove been learning with the most common way of doingthis being flashcards One study showed a retention difference of 46 between students who used active recall as alearning strategy and the control group which used a passive reading technique As you practice recall itrsquos importantto employ spaced repetition The recommended spaced repetition is 1 day 3 days 7 days and 21 days to offset theforgetting curve

Chunking This strategy refers to tying individual pieces of information into larger units For example you canremember the Buddhist Precepts for lay people by tying them into the acronym KILSS (No Killing No Intoxication

43

Data Science Primer Documentation

No Lying No Stealing No Sexual Misconduct) You can also chunk information into images For example computermemory hierarchy consists of five levels which can be visualized as a five next to a pyramid

Interleaving To interleave is to mix your learning with other subjects and approaches One study showed that blockedpractice on one problem type led to students not being able to tell the difference between problems later on and showedthat interleaving led to better results In my case whenever I have a subject Irsquod like to learn more about I collect adiverse source of resources When I wanted to learn about Machine Learning I combined the technical learningthrough textbooks and online courses with science fiction tv shows podcasts and movies This strategy can also beused for fields that have some relation to one another For example Chemistry and Biology where yoursquoll be able tonote both similarities and differences

Deliberate Practice This type of practice refers to focused concentration with the goal of furthering onersquos abilitiesKeeping in mind that some subjects are easier to do deliberate practice than others generally the gold standard ofdeliberate practice consists of the following

bull A feedback loop in the form of a competition or a test

bull A teacher that can guide you

bull Uniquely tailored practice regiment

bull Outside of your comfort zone

bull Has a concrete goal or performance

bull When practicing you need to concentrate fully

bull Builds or modifies skills

bull Has a known training method or mental model from experts in the field that you can aspire towards

Your Ideal Teacher An experienced teacher can mean the difference between breaking through or breaking down Assuch itrsquos important to consider the following when looking for a teacher

bull They are someone who can keep you up to date on specific progress

bull Provides practice exercises

bull Directs attention to what aspects of learning you should be paying attention to

bull Helps develop correct mental representations

bull Provides feedback on what errors yoursquore making

bull Gives you an aspirational model for what good performance looks like

bull Dynamic Testing As you learn itrsquos important to have some sort of feedback loop to see how much yoursquoreactually learning and where your holes are The easiest way to do this is with flashcards but other ways couldinclude writing a blog post or explaining a concept to a friend who can point out inconsistencies In doing soyoursquoll quickly notice the gaps in your knowledge and the concepts you need to revisit in your future learningsessions

Pomodoro technique This is a technique Irsquove been using consistently for the past few years and Irsquove been very happywith it so far The concept is pretty simple you start a timer for either 25 or 50 minutes after which you focus on onesingle thing Once the timer is up you have a 5 or 10-minute break before starting again

163 Common Learning Traps

Authority Trap The authority trap refers to the tendency to believe someone is correct simply because of their statusor prestige As an effective learner itrsquos important to keep an open mind and be able to entertain multiple contradictingideas This is especially the case as you start getting deeper into a field of study where certain issues are still beingdebated Keeping an open mind and referencing multiple resources is key in forming the right mental models

44 Chapter 16 Learning How To Learn

Data Science Primer Documentation

Confirmation Bias Only seeking out knowledge that confirms existing beliefs and disregarding counter evidence Ifyou find yourself agreeing with everything yoursquore learning find something that you disagree with to mix things up

Dunning Kruger Effect Not realizing yoursquore incompetent To combat this find ways to test your abilities throughdynamic testing or in public forums

Einstellung Our mindset prevents us from seeing new solutions or grasping new knowledge To counteract this learnto step back from the problem and take a break Sometimes all it takes is a relaxing hot shower or a long walk to beable to break through a tough conceptual learning challenge

Fluency Illusion Learning something complex and thinking you understand it For example you may read aboutcomputing integrals thinking you understand them then when tested you fail to get any of the answers correct Thushaving some sort of feedback loop to check your understanding is the quickest way to defuse this Another way wouldbe to explain what you learned to someone smarter than you

Multitasking Switching constantly between tasks and thinking modes When yoursquore learning itrsquos important to focuscompletely on the learning material at hand By letting your attention be hijacked by notifications or other lessimportant tasks you miss out on the ability to get into a flow state

164 Steps For Effective Learning

Before iterating through these steps itrsquos important to compile a set of tasks and resources that yoursquoll be referencingduring your time spent learning You donrsquot want to waste time filtering through content so make sure you have yourlearning material ready

1 First check your mood Are you sleepyagitatedupset If so change your mood before starting to learn Thiscould be as simple as taking a walk drinking a cup of water or taking a nap

2 Before engaging with the material try your best to forget what you know about the subject opening yourself tonew experiences and interpretations

3 While learning maintain an active mode of thinking and avoid lapsing into passive learning This consists ofquestioning prodding and trying to connect the material back to previous experiences

4 Once yoursquove covered a certain threshold of material you can then employ dynamic testing with the new conceptsusing flashcards with spaced repetition to convert the learning into long-term memory

5 Once yoursquove got a good grasp of the material proceed to teach the concepts to someone else This could be afriend or even a stranger on the internet

6 To really nail down what you learned reproduce or incorporate the learning somehow into your life This couldentail writing turning your learning into a mindmap or integrating it with other projects yoursquore working on

Additional Resources

bull httpswwwforbescomsitesstevenkotler20130602learning-to-learn-faster-the-one-superpower-everyone-needs53a1a7f62dd7

bull httpsstaticcoggleitdiagramWMbg3JvOtwABM9gVtlearning-how-to-learn

bull httpscoggleitdiagramV83cTEMcVU1E-DGftlearning-to-learn-cease-to-grow-E2809Dc52462e2ac70ccff427070e1cf650eacdbe1cf06cd0eb9f0013dc53a2079cc8a

bull httpswwwcourseraorglearnlearning-how-to-learn

164 Steps For Effective Learning 45

Data Science Primer Documentation

46 Chapter 16 Learning How To Learn

CHAPTER 17

Communication

bull Introduction

bull Subj_1

171 Introduction

xxx

172 Subj_1

Code

Some code

print(hello)

References

47

Data Science Primer Documentation

48 Chapter 17 Communication

CHAPTER 18

Product

bull Introduction

bull Subj_1

181 Introduction

xxx

182 Subj_1

xxx

References

49

Data Science Primer Documentation

50 Chapter 18 Product

CHAPTER 19

Stakeholder Management

bull Introduction

bull Subj_1

191 Introduction

xxx

192 Subj_1

Code

Some code

print(hello)

References

51

Data Science Primer Documentation

52 Chapter 19 Stakeholder Management

CHAPTER 20

Datasets

Public datasets in vision nlp and more forked from caesar0301rsquos awesome datasets wiki

bull Agriculture

bull Art

bull Biology

bull ChemistryMaterials Science

bull ClimateWeather

bull Complex Networks

bull Computer Networks

bull Data Challenges

bull Earth Science

bull Economics

bull Education

bull Energy

bull Finance

bull GIS

bull Government

bull Healthcare

bull Image Processing

bull Machine Learning

bull Museums

53

Data Science Primer Documentation

bull Music

bull Natural Language

bull Neuroscience

bull Physics

bull PsychologyCognition

bull Public Domains

bull Search Engines

bull Social Networks

bull Social Sciences

bull Software

bull Sports

bull Time Series

bull Transportation

201 Agriculture

bull US Department of Agriculturersquos PLANTS Database

bull US Department of Agriculturersquos Nutrient Database

202 Art

bull Googlersquos Quick Draw Sketch Dataset

203 Biology

bull 1000 Genomes

bull American Gut (Microbiome Project)

bull Broad Bioimage Benchmark Collection (BBBC)

bull Broad Cancer Cell Line Encyclopedia (CCLE)

bull Cell Image Library

bull Complete Genomics Public Data

bull EBI ArrayExpress

bull EBI Protein Data Bank in Europe

bull Electron Microscopy Pilot Image Archive (EMPIAR)

bull ENCODE project

bull Ensembl Genomes

54 Chapter 20 Datasets

Data Science Primer Documentation

bull Gene Expression Omnibus (GEO)

bull Gene Ontology (GO)

bull Global Biotic Interactions (GloBI)

bull Harvard Medical School (HMS) LINCS Project

bull Human Genome Diversity Project

bull Human Microbiome Project (HMP)

bull ICOS PSP Benchmark

bull International HapMap Project

bull Journal of Cell Biology DataViewer

bull MIT Cancer Genomics Data

bull NCBI Proteins

bull NCBI Taxonomy

bull NCI Genomic Data Commons

bull NIH Microarray data or FTP (see FTP link on RAW)

bull OpenSNP genotypes data

bull Pathguid - Protein-Protein Interactions Catalog

bull Protein Data Bank

bull Psychiatric Genomics Consortium

bull PubChem Project

bull PubGene (now Coremine Medical)

bull Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

bull Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

bull Sequence Read Archive(SRA)

bull Stanford Microarray Data

bull Stowers Institute Original Data Repository

bull Systems Science of Biological Dynamics (SSBD) Database

bull The Cancer Genome Atlas (TCGA) available via Broad GDAC

bull The Catalogue of Life

bull The Personal Genome Project or PGP

bull UCSC Public Data

bull UniGene

bull Universal Protein Resource (UnitProt)

203 Biology 55

Data Science Primer Documentation

204 ChemistryMaterials Science

bull NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

bull Open Quantum Materials Database

bull Citrination Public Datasets

bull Khazana Project

205 ClimateWeather

bull Actuaries Climate Index

bull Australian Weather

bull Aviation Weather Center - Consistent timely and accurate weather information for the world airspace system

bull Brazilian Weather - Historical data (In Portuguese)

bull Canadian Meteorological Centre

bull Climate Data from UEA (updated monthly)

bull European Climate Assessment amp Dataset

bull Global Climate Data Since 1929

bull NASA Global Imagery Browse Services

bull NOAA Bering Sea Climate

bull NOAA Climate Datasets

bull NOAA Realtime Weather Models

bull NOAA SURFRAD Meteorology and Radiation Datasets

bull The World Bank Open Data Resources for Climate Change

bull UEA Climatic Research Unit

bull WorldClim - Global Climate Data

bull WU Historical Weather Worldwide

206 Complex Networks

bull AMiner Citation Network Dataset

bull CrossRef DOI URLs

bull DBLP Citation dataset

bull DIMACS Road Networks Collection

bull NBER Patent Citations

bull Network Repository with Interactive Exploratory Analysis Tools

bull NIST complex networks data collection

bull Protein-protein interaction network

56 Chapter 20 Datasets

Data Science Primer Documentation

bull PyPI and Maven Dependency Network

bull Scopus Citation Database

bull Small Network Data

bull Stanford GraphBase (Steven Skiena)

bull Stanford Large Network Dataset Collection

bull Stanford Longitudinal Network Data Sources

bull The Koblenz Network Collection

bull The Laboratory for Web Algorithmics (UNIMI)

bull The Nexus Network Repository

bull UCI Network Data Repository

bull UFL sparse matrix collection

bull WSU Graph Database

207 Computer Networks

bull 35B Web Pages from CommonCrawl 2012

bull 535B Web clicks of 100K users in Indiana Univ

bull CAIDA Internet Datasets

bull ClueWeb09 - 1B web pages

bull ClueWeb12 - 733M web pages

bull CommonCrawl Web Data over 7 years

bull CRAWDAD Wireless datasets from Dartmouth Univ

bull Criteo click-through data

bull OONI Open Observatory of Network Interference - Internet censorship data

bull Open Mobile Data by MobiPerf

bull Rapid7 Sonar Internet Scans

bull UCSD Network Telescope IPv4 8 net

208 Data Challenges

bull Bruteforce Database

bull Challenges in Machine Learning

bull CrowdANALYTIX dataX

bull D4D Challenge of Orange

bull DrivenData Competitions for Social Good

bull ICWSM Data Challenge (since 2009)

bull Kaggle Competition Data

207 Computer Networks 57

Data Science Primer Documentation

bull KDD Cup by Tencent 2012

bull Localytics Data Visualization Challenge

bull Netflix Prize

bull Space Apps Challenge

bull Telecom Italia Big Data Challenge

bull TravisTorrent Dataset - MSRlsquo2017 Mining Challenge

bull Yelp Dataset Challenge

209 Earth Science

bull AQUASTAT - Global water resources and uses

bull BODC - marine data of ~22K vars

bull Earth Models

bull EOSDIS - NASArsquos earth observing system data

bull Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements or on S3

bull Marinexplore - Open Oceanographic Data

bull Smithsonian Institution Global Volcano and Eruption Database

bull USGS Earthquake Archives

2010 Economics

bull American Economic Association (AEA)

bull EconData from UMD

bull Economic Freedom of the World Data

bull Historical MacroEconomc Statistics

bull International Economics Database and various data tools

bull International Trade Statistics

bull Internet Product Code Database

bull Joint External Debt Data Hub

bull Jon Haveman International Trade Data Links

bull OpenCorporates Database of Companies in the World

bull Our World in Data

bull SciencesPo World Trade Gravity Datasets

bull The Atlas of Economic Complexity

bull The Center for International Data

bull The Observatory of Economic Complexity

bull UN Commodity Trade Statistics

58 Chapter 20 Datasets

Data Science Primer Documentation

bull UN Human Development Reports

2011 Education

bull College Scorecard Data

bull Student Data from Free Code Camp

2012 Energy

bull AMPds

bull BLUEd

bull COMBED

bull Dataport

bull DRED

bull ECO

bull EIA

bull HES - Household Electricity Study UK

bull HFED

bull iAWE

bull PLAID - the Plug Load Appliance Identification Dataset

bull REDD

bull Tracebase

bull UK-DALE - UK Domestic Appliance-Level Electricity

bull WHITED

2013 Finance

bull CBOE Futures Exchange

bull Google Finance

bull Google Trends

bull NASDAQ

bull NYSE Market Data (see FTP link on RAW)

bull OANDA

bull OSU Financial data

bull Quandl

bull St Louis Federal

bull Yahoo Finance

2011 Education 59

Data Science Primer Documentation

2014 GIS

bull ArcGIS Open Data portal

bull Cambridge MA US GIS data on GitHub

bull Factual Global Location Data

bull Geo Spatial Data from ASU

bull Geo Wiki Project - Citizen-driven Environmental Monitoring

bull GeoFabrik - OSM data extracted to a variety of formats and areas

bull GeoNames Worldwide

bull Global Administrative Areas Database (GADM)

bull Homeland Infrastructure Foundation-Level Data

bull Landsat 8 on AWS

bull List of all countries in all languages

bull National Weather Service GIS Data Portal

bull Natural Earth - vectors and rasters of the world

bull OpenAddresses

bull OpenStreetMap (OSM)

bull Pleiades - Gazetteer and graph of ancient places

bull Reverse Geocoder using OSM data amp additional high-resolution data files

bull TIGERLine - US boundaries and roads

bull TwoFishes - Foursquarersquos coarse geocoder

bull TZ Timezones shapfiles

bull UN Environmental Data

bull World boundaries from the US Department of State

bull World countries in multiple formats

2015 Government

bull A list of cities and countries contributed by community

bull Open Data for Africa

bull OpenDataSoftrsquos list of 1600 open data

2016 Healthcare

bull EHDP Large Health Data Sets

bull Gapminder World demographic databases

bull Medicare Coverage Database (MCD) US

60 Chapter 20 Datasets

Data Science Primer Documentation

bull Medicare Data Engine of medicaregov Data

bull Medicare Data File

bull MeSH the vocabulary thesaurus used for indexing articles for PubMed

bull Number of Ebola Cases and Deaths in Affected Countries (2014)

bull Open-ODS (structure of the UK NHS)

bull OpenPaymentsData Healthcare financial relationship data

bull The Cancer Genome Atlas project (TCGA) and BigQuery table

bull World Health Organization Global Health Observatory

2017 Image Processing

bull 10k US Adult Faces Database

bull 2GB of Photos of Cats or Archive version

bull Adience Unfiltered faces for gender and age classification

bull Affective Image Classification

bull Animals with attributes

bull Caltech Pedestrian Detection Benchmark

bull Chars74K dataset Character Recognition in Natural Images (both English and Kannada are available)

bull Face Recognition Benchmark

bull GDXray X-ray images for X-ray testing and Computer Vision

bull ImageNet (in WordNet hierarchy)

bull Indoor Scene Recognition

bull International Affective Picture System UFL

bull Massive Visual Memory Stimuli MIT

bull MNIST database of handwritten digits near 1 million examples

bull Several Shape-from-Silhouette Datasets

bull Stanford Dogs Dataset

bull SUN database MIT

bull The Action Similarity Labeling (ASLAN) Challenge

bull The Oxford-IIIT Pet Dataset

bull Violent-Flows - Crowd Violence Non-violence Database and benchmark

bull Visual genome

bull YouTube Faces Database

2017 Image Processing 61

Data Science Primer Documentation

2018 Machine Learning

bull Context-aware data sets from five domains

bull Delve Datasets for classification and regression (Univ of Toronto)

bull Discogs Monthly Data

bull eBay Online Auctions (2012)

bull IMDb Database

bull Keel Repository for classification regression and time series

bull Labeled Faces in the Wild (LFW)

bull Lending Club Loan Data

bull Machine Learning Data Set Repository

bull Million Song Dataset

bull More Song Datasets

bull MovieLens Data Sets

bull New Yorker caption contest ratings

bull RDataMining - ldquoR and Data Miningrdquo ebook data

bull Registered Meteorites on Earth

bull Restaurants Health Score Data in San Francisco

bull UCI Machine Learning Repository

bull Yahoo Ratings and Classification Data

bull Youtube 8m

2019 Museums

bull Canada Science and Technology Museums Corporationrsquos Open Data

bull Cooper-Hewittrsquos Collection Database

bull Minneapolis Institute of Arts metadata

bull Natural History Museum (London) Data Portal

bull Rijksmuseum Historical Art Collection

bull Tate Collection metadata

bull The Getty vocabularies

2020 Music

bull Nottingham Folk Songs

bull Bach 10

62 Chapter 20 Datasets

Data Science Primer Documentation

2021 Natural Language

bull Automatic Keyphrase Extracttion

bull Blogger Corpus

bull CLiPS Stylometry Investigation Corpus

bull ClueWeb09 FACC

bull ClueWeb12 FACC

bull DBpedia - 458M things with 583M facts

bull Flickr Personal Taxonomies

bull Freebasecom of people places and things

bull Google Books Ngrams (22TB)

bull Google MC-AFP generated based on the public available Gigaword dataset using Paragraph Vectors

bull Google Web 5gram (1TB 2006)

bull Gutenberg eBooks List

bull Hansards text chunks of Canadian Parliament

bull Machine Comprehension Test (MCTest) of text from Microsoft Research

bull Machine Translation of European languages

bull Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

bull Multi-Domain Sentiment Dataset (version 20)

bull Open Multilingual Wordnet

bull Personae Corpus

bull SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic 30K articles)

bull SMS Spam Collection in English

bull Universal Dependencies

bull USENET postings corpus of 2005~2011

bull Webhose - NewsBlogs in multiple languages

bull Wikidata - Wikipedia databases

bull Wikipedia Links data - 40 Million Entities in Context

bull WordNet databases and tools

2022 Neuroscience

bull Allen Institute Datasets

bull Brain Catalogue

bull Brainomics

bull CodeNeuro Datasets

bull Collaborative Research in Computational Neuroscience (CRCNS)

2021 Natural Language 63

Data Science Primer Documentation

bull FCP-INDI

bull Human Connectome Project

bull NDAR

bull NeuroData

bull Neuroelectro

bull NIMH Data Archive

bull OASIS

bull OpenfMRI

bull Study Forrest

2023 Physics

bull CERN Open Data Portal

bull Crystallography Open Database

bull NASA Exoplanet Archive

bull NSSDC (NASA) data of 550 space spacecraft

bull Sloan Digital Sky Survey (SDSS) - Mapping the Universe

2024 PsychologyCognition

bull OSU Cognitive Modeling Repository Datasets

2025 Public Domains

bull Amazon

bull Archive-it from Internet Archive

bull Archiveorg Datasets

bull CMU JASA data archive

bull CMU StatLab collections

bull DataWorld

bull Data360

bull Datamoborg

bull Google

bull Infochimps

bull KDNuggets Data Collections

bull Microsoft Azure Data Market Free DataSets

bull Microsoft Data Science for Research

64 Chapter 20 Datasets

Data Science Primer Documentation

bull Numbray

bull Open Library Data Dumps

bull Reddit Datasets

bull RevolutionAnalytics Collection

bull Sample R data sets

bull Stats4Stem R data sets

bull StatSciorg

bull The Washington Post List

bull UCLA SOCR data collection

bull UFO Reports

bull Wikileaks 911 pager intercepts

bull Yahoo Webscope

2026 Search Engines

bull Academic Torrents of data sharing from UMB

bull Datahubio

bull DataMarket (Qlik)

bull Harvard Dataverse Network of scientific data

bull ICPSR (UMICH)

bull Institute of Education Sciences

bull National Technical Reports Library

bull Open Data Certificates (beta)

bull OpenDataNetwork - A search engine of all Socrata powered data portals

bull Statistacom - statistics and Studies

bull Zenodo - An open dependable home for the long-tail of science

2027 Social Networks

bull 72 hours gamergate Twitter Scrape

bull Ancestrycom Forum Dataset over 10 years

bull Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

bull CMU Enron Email of 150 users

bull EDRM Enron EMail of 151 users hosted on S3

bull Facebook Data Scrape (2005)

bull Facebook Social Networks from LAW (since 2007)

bull Foursquare from UMNSarwat (2013)

2026 Search Engines 65

Data Science Primer Documentation

bull GitHub Collaboration Archive

bull Google Scholar citation relations

bull High-Resolution Contact Networks from Wearable Sensors

bull Mobile Social Networks from UMASS

bull Network Twitter Data

bull Reddit Comments

bull Skytraxrsquo Air Travel Reviews Dataset

bull Social Twitter Data

bull SourceForgenet Research Data

bull Twitter Data for Online Reputation Management

bull Twitter Data for Sentiment Analysis

bull Twitter Graph of entire Twitter site

bull Twitter Scrape Calufa May 2011

bull UNIMILAW Social Network Datasets

bull Yahoo Graph and Social Data

bull Youtube Video Social Graph in 20072008

2028 Social Sciences

bull ACLED (Armed Conflict Location amp Event Data Project)

bull Canadian Legal Information Institute

bull Center for Systemic Peace Datasets - Conflict Trends Polities State Fragility etc

bull Correlates of War Project

bull Cryptome Conspiracy Theory Items

bull Datacards

bull European Social Survey

bull FBI Hate Crime 2013 - aggregated data

bull Fragile States Index

bull GDELT Global Events Database

bull General Social Survey (GSS) since 1972

bull German Social Survey

bull Global Religious Futures Project

bull Humanitarian Data Exchange

bull INFORM Index for Risk Management

bull Institute for Demographic Studies

bull International Networks Archive

66 Chapter 20 Datasets

Data Science Primer Documentation

bull International Social Survey Program ISSP

bull International Studies Compendium Project

bull James McGuire Cross National Data

bull MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

bull Minnesota Population Center

bull MIT Reality Mining Dataset

bull Notre Dame Global Adaptation Index (NG-DAIN)

bull Open Crime and Policing Data in England Wales and Northern Ireland

bull Paul Hensel General International Data Page

bull PewResearch Internet Survey Project

bull PewResearch Society Data Collection

bull Political Polarity Data

bull StackExchange Data Explorer

bull Terrorism Research and Analysis Consortium

bull Texas Inmates Executed Since 1984

bull Titanic Survival Data Set or on Kaggle

bull UCBrsquos Archive of Social Science Data (D-Lab)

bull UCLA Social Sciences Data Archive

bull UN Civil Society Database

bull Universities Worldwide

bull UPJOHN for Labor Employment Research

bull Uppsala Conflict Data Program

bull World Bank Open Data

bull WorldPop project - Worldwide human population distributions

2029 Software

bull FLOSSmole data about free libre and open source software development

2030 Sports

bull Basketball (NBANCAAEuro) Player Database and Statistics

bull Betfair Historical Exchange Data

bull Cricsheet Matches (cricket)

bull Ergast Formula 1 from 1950 up to date (API)

bull FootballSoccer resources (data and APIs)

bull Lahmanrsquos Baseball Database

2029 Software 67

Data Science Primer Documentation

bull Pinhooker Thoroughbred Bloodstock Sale Data

bull Retrosheet Baseball Statistics

bull Tennis database of rankings results and stats for ATP WTA Grand Slams and Match Charting Project

2031 Time Series

bull Databanks International Cross National Time Series Data Archive

bull Hard Drive Failure Rates

bull Heart Rate Time Series from MIT

bull Time Series Data Library (TSDL) from MU

bull UC Riverside Time Series Dataset

2032 Transportation

bull Airlines OD Data 1987-2008

bull Bay Area Bike Share Data

bull Bike Share Systems (BSS) collection

bull GeoLife GPS Trajectory from Microsoft Research

bull German train system by Deutsche Bahn

bull Hubway Million Rides in MA

bull Marine Traffic - ship tracks port calls and more

bull Montreal BIXI Bike Share

bull NYC Taxi Trip Data 2009-

bull NYC Taxi Trip Data 2013 (FOIAFOILed)

bull NYC Uber trip data April 2014 to September 2014

bull Open Traffic collection

bull OpenFlights - airport airline and route data

bull Philadelphia Bike Share Stations (JSON)

bull Plane Crash Database since 1920

bull RITA Airline On-Time Performance data

bull RITABTS transport data collection (TranStat)

bull Toronto Bike Share Stations (XML file)

bull Transport for London (TFL)

bull Travel Tracker Survey (TTS) for Chicago

bull US Bureau of Transportation Statistics (BTS)

bull US Domestic Flights 1990 to 2009

bull US Freight Analysis Framework since 2007

68 Chapter 20 Datasets

CHAPTER 21

Libraries

Machine learning libraries and frameworks forked from josephmistirsquos awesome machine learning

bull APL

bull C

bull C++

bull Common Lisp

bull Clojure

bull Elixir

bull Erlang

bull Go

bull Haskell

bull Java

bull Javascript

bull Julia

bull Lua

bull Matlab

bull NET

bull Objective C

bull OCaml

bull PHP

bull Python

69

Data Science Primer Documentation

bull Ruby

bull Rust

bull R

bull SAS

bull Scala

bull Swift

211 APL

General-Purpose Machine Learning

bull naive-apl - Naive Bayesian Classifier implementation in APL

212 C

General-Purpose Machine Learning

bull Darknet - Darknet is an open source neural network framework written in C and CUDA It is fast easy to installand supports CPU and GPU computation

bull Recommender - A C library for product recommendationssuggestions using collaborative filtering (CF)

bull Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms

Computer Vision

bull CCV - C-basedCachedCore Computer Vision Library A Modern Computer Vision Library

bull VLFeat - VLFeat is an open and portable library of computer vision algorithms which has Matlab toolbox

Speech Recognition

bull HTK -The Hidden Markov Model Toolkit HTK is a portable toolkit for building and manipulating hiddenMarkov models

213 C++

Computer Vision

bull DLib - DLib has C++ and Python interfaces for face detection and training general object detectors

bull EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models

bull OpenCV - OpenCV has C++ C Python Java and MATLAB interfaces and supports Windows Linux Androidand Mac OS

70 Chapter 21 Libraries

Data Science Primer Documentation

bull VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes ofarbitrary dimensionality with Python bindings

General-Purpose Machine Learning

bull BanditLib - A simple Multi-armed Bandit library

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind [DEEP LEARN-ING]

bull CNTK by Microsoft Research is a unified deep-learning toolkit that describes neural networks as a series ofcomputational steps via a directed graph

bull CUDA - This is a fast C++CUDA implementation of convolutional [DEEP LEARNING]

bull CXXNET - Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]

bull DeepDetect - A machine learning API and server written in C++11 It makes state of the art machine learningeasy to work with and integrate into existing applications

bull Disrtibuted Machine learning Tool Kit (DMTK) Word Embedding

bull DLib - A suite of ML tools designed to be easy to imbed in other applications

bull DSSTNE - A software library created by Amazon for training and deploying deep neural networks using GPUswhich emphasizes speed and scale over experimental flexibility

bull DyNet - A dynamic neural network library working well with networks that have dynamic structures that changefor every training instance Written in C++ with bindings in Python

bull encog-cpp

bull Fido - A highly-modular C++ machine learning library for embedded electronics and robotics

bull igraph - General purpose graph library

bull Intel(R) DAAL - A high performance software library developed by Intel and optimized for Intelrsquos architecturesLibrary provides algorithmic building blocks for all stages of data analytics and allows to process data in batchonline and distributed modes

bull LightGBM framework based on decision tree algorithms used for ranking classification and many other ma-chine learning tasks

bull MLDB - The Machine Learning Database is a database designed for machine learning Send it commands overa RESTful API to store data explore it using SQL then train machine learning models and expose them asAPIs

bull mlpack - A scalable C++ machine learning library

bull ROOT - A modular scientific software framework It provides all the functionalities needed to deal with big dataprocessing statistical analysis visualization and storage

bull shark - A fast modular feature-rich open-source C++ machine learning library

bull Shogun - The Shogun Machine Learning Toolbox

bull sofia-ml - Suite of fast incremental algorithms

bull Stan - A probabilistic programming language implementing full Bayesian statistical inference with HamiltonianMonte Carlo sampling

bull Timbl - A software packageC++ library implementing several memory-based learning algorithms among whichIB1-IG an implementation of k-nearest neighbor classification and IGTree a decision-tree approximation ofIB1-IG Commonly used for NLP

213 C++ 71

Data Science Primer Documentation

bull Vowpal Wabbit (VW) - A fast out-of-core learning system

bull Warp-CTC on both CPU and GPU

bull XGBoost - A parallelized optimized general purpose gradient boosting library

Natural Language Processing

bull BLLIP Parser

bull colibri-core - C++ library command line tools and Python binding for extracting and working with basic lin-guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way

bull CRF++ for segmentinglabeling sequential data amp other Natural Language Processing tasks

bull CRFsuite for labeling sequential data

bull frog - Memory-based NLP suite developed for Dutch PoS tagger lemmatiser dependency parser NER shallowparser morphological analyzer

bull libfolia](httpsgithubcomLanguageMachineslibfolia) - C++ library for the [FoLiA format

bull MeTA](httpsgithubcommeta-toolkitmeta) - [MeTA ModErn Text Analysis is a C++ Data Sciences Toolkitthat facilitates mining big text data

bull MIT Information Extraction Toolkit - C C++ and Python tools for named entity recognition and relation ex-traction

bull ucto - Unicode-aware regular-expression based tokenizer for various languages Tool and C++ library SupportsFoLiA format

Speech Recognition

bull Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v20Kaldi is intended for use by speech recognition researchers

Sequence Analysis

bull ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-quences over a user defined alphabet

Gesture Detection

bull grt - The Gesture Recognition Toolkit GRT is a cross-platform open-source C++ machine learning librarydesigned for real-time gesture recognition

214 Common Lisp

General-Purpose Machine Learning

bull mgl Gaussian Processes

bull mgl-gpr - Evolutionary algorithms

bull cl-libsvm - Wrapper for the libsvm support vector machine library

72 Chapter 21 Libraries

Data Science Primer Documentation

215 Clojure

Natural Language Processing

bull Clojure-openNLP - Natural Language Processing in Clojure (opennlp)

bull Infections-clj - Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

bull Touchstone - Clojure AB testing library

bull Clojush - The Push programming language and the PushGP genetic programming system implemented in Clo-jure

bull Infer - Inference and machine learning in clojure

bull Clj-ML - A machine learning library for Clojure built on top of Weka and friends

bull DL4CLJ - Clojure wrapper for Deeplearning4j

bull Encog

bull Fungp - A genetic programming library for Clojure

bull Statistiker - Basic Machine Learning algorithms in Clojure

bull clortex - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull comportex - Functionally composable Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull cortex - Neural networks regression and feature learning in Clojure

bull lambda-ml - Simple concise implementations of machine learning techniques and utilities in Clojure

Data Analysis Data Visualization

bull Incanter - Incanter is a Clojure-based R-like platform for statistical computing and graphics

bull PigPen - Map-Reduce for Clojure

bull Envision - Clojure Data Visualisation library based on Statistiker and D3

216 Elixir

General-Purpose Machine Learning

bull Simple Bayes - A Simple Bayes Naive Bayes implementation in Elixir

Natural Language Processing

bull Stemmer stemming implementation in Elixir

215 Clojure 73

Data Science Primer Documentation

217 Erlang

General-Purpose Machine Learning

bull Disco - Map Reduce in Erlang

218 Go

Natural Language Processing

bull go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm

bull paicehusk - Golang implementation of the PaiceHusk Stemming Algorithm

bull snowball - Snowball Stemmer for Go

bull go-ngram - In-memory n-gram index with compression

General-Purpose Machine Learning

bull gago - Multi-population flexible parallel genetic algorithm

bull Go Learn - Machine Learning for Go

bull go-pr - Pattern recognition package in Go lang

bull go-ml - Linear Logistic regression Neural Networks Collaborative Filtering and Gaussian Multivariate Dis-tribution

bull bayesian - Naive Bayesian Classification for Golang

bull go-galib - Genetic Algorithms library written in Go golang

bull Cloudforest - Ensembles of decision trees in gogolang

bull gobrain - Neural Networks written in go

bull GoNN - GoNN is an implementation of Neural Network in Go Language which includes BPNN RBF PCN

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model

Data Analysis Data Visualization

bull go-graph - Graph library for Gogolang language

bull SVGo - The Go Language library for SVG generation

bull RF - Random forests implementation in Go

74 Chapter 21 Libraries

Data Science Primer Documentation

219 Haskell

General-Purpose Machine Learning

bull haskell-ml - Haskell implementations of various ML algorithms

bull HLearn - a suite of libraries for interpreting machine learning models according to their algebraic structure

bull hnn - Haskell Neural Network library

bull hopfield-networks - Hopfield Networks for unsupervised learning in Haskell

bull caffegraph - A DSL for deep neural networks

bull LambdaNet - Configurable Neural Networks in Haskell

2110 Java

Natural Language Processing

bull Corticalio as quickly and intuitively as the brain

bull CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw Englishlanguage text input and give the base forms of words

bull Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences

bull Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger

bull Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer

bull Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks

bull Tregex Tsurgeon and Semgrex

bull Stanford Phrasal A Phrase-Based Translation System

bull Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translationsystem written in Java

bull Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens which roughly correspond toldquowordsrdquo

bull Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions

bull Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterativefashion

bull Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to performanalysis on datasets

bull Twitter Text Java - A Java implementation of Twitterrsquos text processing library

bull MALLET - A Java-based package for statistical natural language processing document classification cluster-ing topic modeling information extraction and other machine learning applications to text

bull OpenNLP - a machine learning based toolkit for the processing of natural language text

bull LingPipe - A tool kit for processing text using computational linguistics

bull ClearTK components in Java and is built on top of Apache UIMA

219 Haskell 75

Data Science Primer Documentation

bull Apache cTAKES is an open-source natural language processing system for information extraction from elec-tronic medical record clinical free-text

bull ClearNLP - The ClearNLP project provides software and resources for natural language processing The projectstarted at the Center for Computational Language and EducAtion Research and is currently developed by theCenter for Language and Information Research at Emory University This project is under the Apache 2 license

bull CogcompNLP developed in the University of Illinoisrsquo Cognitive Computation Group for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that supportwriting NLP applications running experiments etc illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages

General-Purpose Machine Learning

bull aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly

bull Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-tions

bull ELKI

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull H2O - ML engine that supports distributed learning on Hadoop Spark or your laptop via APIs in R PythonScala RESTJSON

bull htmjava - General Machine Learning library using Numentarsquos Cortical Learning Algorithm

bull java-deeplearning - Distributed Deep Learning Platform for Java ClojureScala

bull Mahout - Distributed machine learning

bull Meka

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Neuroph - Neuroph is lightweight Java neural network framework

bull ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization forreal-time large-scale machine learning

bull Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface toplug-in different stream processing platforms

bull RankLib - RankLib is a library of learning to rank algorithms

bull rapaio - statistics data mining and machine learning toolbox in Java

bull RapidMiner - RapidMiner integration into Java code

bull Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one ofk classes

bull SmileMiner - Statistical Machine Intelligence amp Learning Engine

bull SystemML language

76 Chapter 21 Libraries

Data Science Primer Documentation

bull WalnutiQ - object oriented model of the human brain

bull Weka - Weka is a collection of machine learning algorithms for data mining tasks

bull LBJava - Learning Based Java is a modeling language for the rapid development of software systems offersa convenient declarative syntax for classifier and constraint definition directly in terms of the objects in theprogrammerrsquos application

Speech Recognition

bull CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library

Data Analysis Data Visualization

bull Flink - Open source platform for distributed stream and batch data processing

bull Hadoop - HadoopHDFS

bull Spark - Spark is a fast and general engine for large-scale data processing

bull Storm - Storm is a distributed realtime computation system

bull Impala - Real-time Query for Hadoop

bull DataMelt - Mathematics software for numeric computation statistics symbolic calculations data analysis anddata visualization

bull Dr Michael Thomas Flanaganrsquos Java Scientific Library

Deep Learning

bull Deeplearning4j - Scalable deep learning for industry with parallel GPUs

2111 Javascript

Natural Language Processing

bull Twitter-text - A JavaScript implementation of Twitterrsquos text processing library

bull NLPjs - NLP utilities in javascript and coffeescript

bull natural - General natural language facilities for node

bull Knwljs - A Natural Language Processor in JS

bull Retext - Extensible system for analyzing and manipulating natural language

bull TextProcessing - Sentiment analysis stemming and lemmatization part-of-speech tagging and chunking phraseextraction and named entity recognition

bull NLP Compromise - Natural Language processing in the browser

2111 Javascript 77

Data Science Primer Documentation

Data Analysis Data Visualization

bull D3js

bull High Charts

bull NVD3js

bull dcjs

bull chartjs

bull dimple

bull amCharts

bull D3xter - Straight forward plotting built on D3

bull statkit - Statistics kit for JavaScript

bull datakit - A lightweight framework for data analysis in JavaScript

bull sciencejs - Scientific and statistical computing in JavaScript

bull Z3d - Easily make interactive 3d plots built on Threejs

bull Sigmajs - JavaScript library dedicated to graph drawing

bull C3js- customizable library based on D3js for easy chart drawing

bull Datamaps- Customizable SVG mapgeo visualizations using D3js

bull ZingChart- library written on Vanilla JS for big data visualization

bull cheminfo - Platform for data visualization and analysis using the visualizer project

General-Purpose Machine Learning

bull Convnetjs - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]

bull Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Nodejs and the browser

bull Clusteringjs - Clustering algorithms implemented in Javascript for Nodejs and the browser

bull Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm

bull DN2A - Digital Neural Networks Architecture

bull figue - K-means fuzzy c-means and agglomerative clustering

bull Node-fann bindings for Nodejs

bull Kmeansjs - Simple Javascript implementation of the k-means algorithm for nodejs and the browser

bull LDAjs - LDA topic modeling for nodejs

bull Learningjs - Javascript implementation of logistic regressionc45 decision tree

bull Machine Learning - Machine learning library for Nodejs

bull machineJS - Automated machine learning data formatting ensembling and hyperparameter optimization forcompetitions and exploration- just give it a csv file

bull mil-tokyo - List of several machine learning libraries

bull Node-SVM - Support Vector Machine for nodejs

bull Brain - Neural networks in JavaScript [Deprecated]

78 Chapter 21 Libraries

Data Science Primer Documentation

bull Bayesian-Bandit - Bayesian bandit implementation for Node and the browser

bull Synaptic - Architecture-free neural network library for nodejs and the browser

bull kNear - JavaScript implementation of the k nearest neighbors algorithm for supervised learning

bull NeuralN - C++ Neural Network library for Nodejs It has advantage on large dataset and multi-threaded train-ing

bull kalman - Kalman filter for Javascript

bull shaman - nodejs library with support for both simple and multiple linear regression

bull mljs - Machine learning and numerical analysis tools for Nodejs and the Browser

bull Pavlovjs - Reinforcement learning using Markov Decision Processes

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

Misc

bull sylvester - Vector and Matrix math for JavaScript

bull simple-statistics as well as in nodejs

bull regression-js - A javascript library containing a collection of least squares fitting methods for finding a trend ina set of data

bull Lyric - Linear Regression library

bull GreatCircle - Library for calculating great circle distance

2112 Julia

General-Purpose Machine Learning

bull MachineLearning - Julia Machine Learning library

bull MLBase - A set of functions to support the development of machine learning algorithms

bull PGM - A Julia framework for probabilistic graphical models

bull DA - Julia package for Regularized Discriminant Analysis

bull Regression

bull Local Regression - Local regression so smooooth

bull Naive Bayes - Simple Naive Bayes implementation in Julia

bull Mixed Models mixed-effects models

bull Simple MCMC - basic mcmc sampler implemented in Julia

bull Distance - Julia module for Distance evaluation

bull Decision Tree - Decision Tree Classifier and Regressor

bull Neural - A neural network in Julia

bull MCMC - MCMC tools for Julia

bull Mamba for Bayesian analysis in Julia

2112 Julia 79

Data Science Primer Documentation

bull GLM - Generalized linear models in Julia

bull Online Learning

bull GLMNet - Julia wrapper for fitting LassoElasticNet GLM models using glmnet

bull Clustering - Basic functions for clustering data k-means dp-means etc

bull SVM - SVMrsquos for Julia

bull Kernal Density - Kernel density estimators for julia

bull Dimensionality Reduction - Methods for dimensionality reduction

bull NMF - A Julia package for non-negative matrix factorization

bull ANN - Julia artificial neural networks

bull Mocha - Deep Learning framework for Julia inspired by Caffe

bull XGBoost - eXtreme Gradient Boosting Package in Julia

bull ManifoldLearning - A Julia package for manifold learning and nonlinear dimensionality reduction

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull Merlin - Flexible Deep Learning Framework in Julia

bull ROCAnalysis - Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers

bull GaussianMixtures - Large scale Gaussian Mixture Models

bull ScikitLearn - Julia implementation of the scikit-learn API

bull Knet - Koccedil University Deep Learning Framework

Natural Language Processing

bull Topic Models - TopicModels for Julia

bull Text Analysis - Julia package for text analysis

Data Analysis Data Visualization

bull Graph Layout - Graph layout algorithms in pure Julia

bull Data Frames Meta - Metaprogramming tools for DataFrames

bull Julia Data - library for working with tabular data in Julia

bull Data Read - Read files from Stata SAS and SPSS

bull Hypothesis Tests - Hypothesis tests for Julia

bull Gadfly - Crafty statistical graphics for Julia

bull Stats - Statistical tests for Julia

bull RDataSets - Julia package for loading many of the data sets available in R

bull DataFrames - library for working with tabular data in Julia

bull Distributions - A Julia package for probability distributions and associated functions

bull Data Arrays - Data structures that allow missing values

80 Chapter 21 Libraries

Data Science Primer Documentation

bull Time Series - Time series toolkit for Julia

bull Sampling - Basic sampling algorithms for Julia

Misc Stuff Presentations

bull DSP

bull JuliaCon Presentations - Presentations for JuliaCon

bull SignalProcessing - Signal Processing tools for Julia

bull Images - An image library for Julia

2113 Lua

General-Purpose Machine Learning

bull Torch7

bull cephes - Cephes mathematical functions library wrapped for Torch Provides and wraps the 180+ specialmathematical functions from the Cephes mathematical library developed by Stephen L Moshier It is usedamong many other places at the heart of SciPy

bull autograd - Autograd automatically differentiates native Torch code Inspired by the original Python version

bull graph - Graph package for Torch

bull randomkit - Numpyrsquos randomkit wrapped for Torch

bull signal - A signal processing toolbox for Torch-7 FFT DCT Hilbert cepstrums stft

bull nn - Neural Network package for Torch

bull torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as wellas encouraging modular programming

bull nngraph - This package provides graphical computation for nn library in Torch7

bull nnx - A completely unstable and experimental package that extends Torchrsquos builtin nn library

bull rnn - A Recurrent Neural Network library that extends Torchrsquos nn RNNs LSTMs GRUs BRNNs BLSTMsetc

bull dpnn - Many useful features that arenrsquot part of the main nn package

bull dp - A deep learning library designed for streamlining research and development using the Torch7 distributionIt emphasizes flexibility through the elegant use of object-oriented design patterns

bull optim - An optimization library for Torch SGD Adagrad Conjugate-Gradient LBFGS RProp and more

bull unsup

bull manifold - A package to manipulate manifolds

bull svm - Torch-SVM library

bull lbfgs - FFI Wrapper for liblbfgs

bull vowpalwabbit - An old vowpalwabbit interface to torch

bull OpenGM - OpenGM is a C++ library for graphical modeling and inference The Lua bindings provide a simpleway of describing graphs from Lua and then optimizing them with OpenGM

2113 Lua 81

Data Science Primer Documentation

bull sphagetti module for torch7 by MichaelMathieu

bull LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit

bull kernel smoothing - KNN kernel-weighted average local linear regression smoothers

bull cutorch - Torch CUDA Implementation

bull cunn - Torch CUDA Neural Network Implementation

bull imgraph - An imagegraph library for Torch This package provides routines to construct graphs on imagessegment them build trees out of them and convert them back to images

bull videograph - A videograph library for Torch This package provides routines to construct graphs on videossegment them build trees out of them and convert them back to videos

bull saliency - code and tools around integral images A library for finding interest points based on fast integralhistograms

bull stitch - allows us to use hugin to stitch images and apply same stitching to a video sequence

bull sfm - A bundle adjustmentstructure from motion package

bull fex - A package for feature extraction in Torch Provides SIFT and dSIFT modules

bull OverFeat - A state-of-the-art generic dense feature extractor

bull Numeric Lua

bull Lunatic Python

bull SciLua

bull Lua - Numerical Algorithms

bull Lunum

Demos and Scripts

bull Core torch7 demos repository linear-regression logistic-regression face detector (training and detectionas separate demos) mst-based-segmenter train-a-digit-classifier train-autoencoder optical flow demo train-on-housenumbers train-on-cifar tracking with deep nets kinect demo filter-bank visualization saliency-networks

bull Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)

bull Music Tagging - Music Tagging scripts for torch7

bull torch-datasets - Scripts to load several popular datasets including BSR 500 CIFAR-10 COIL StreetView House Numbers MNIST NORB

bull Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment

2114 Matlab

Computer Vision

bull Contourlets - MATLAB source code that implements the contourlet transform and its utility functions

bull Shearlets - MATLAB code for shearlet transform

82 Chapter 21 Libraries

Data Science Primer Documentation

bull Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed torepresent images at different scales and different angles

bull Bandlets - MATLAB code for bandlet transform

bull mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

bull NLP - An NLP library for Matlab

General-Purpose Machine Learning

bull Training a deep autoencoder or a classifier on MNIST

bull Convolutional-Recursive Deep Learning for 3D Object Classification - Convolutional-Recursive Deep Learningfor 3D Object Classification[DEEP LEARNING]

bull t-Distributed Stochastic Neighbor Embedding technique for dimensionality reduction that is particularly wellsuited for the visualization of high-dimensional datasets

bull Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab

bull LibSVM - A Library for Support Vector Machines

bull LibLinear - A Library for Large Linear Classification

bull Machine Learning Module - Class on machine w PDFlecturescode

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull Pattern Recognition Toolbox - A complete object-oriented environment for machine learning in Matlab

bull Pattern Recognition and Machine Learning - This package contains the matlab implementation of the algorithmsdescribed in the book Pattern Recognition and Machine Learning by C Bishop

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly with MAT-LAB

Data Analysis Data Visualization

bull matlab_gbl - MatlabBGL is a Matlab package for working with graphs

bull gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGLrsquos mex functions

2115 NET

Computer Vision

bull OpenCVDotNet - A wrapper for the OpenCV project to be used with NET applications

bull Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows LinusMac OS X iOS and Android

bull AForgeNET - Open source C framework for developers and researchers in the fields of Computer Vision andArtificial Intelligence Development has now shifted to GitHub

2115 NET 83

Data Science Primer Documentation

bull AccordNET - Together with AForgeNET this library can provide image processing and computer vision al-gorithms to Windows Windows RT and Windows Phone Some components are also available for Java andAndroid

Natural Language Processing

bull StanfordNLP for NET - A full port of Stanford NLP packages to NET and also available precompiled as aNuGet package

General-Purpose Machine Learning

bull Accord-Framework -The AccordNET Framework is a complete framework for building machine learning com-puter vision computer audition signal processing and statistical applications

bull AccordMachineLearning - Support Vector Machines Decision Trees Naive Bayesian models K-means Gaus-sian Mixture models and general algorithms such as Ransac Cross-validation and Grid-Search for machine-learning applications This package is part of the AccordNET Framework

bull DiffSharp for machine learning and optimization applications Operations can be nested to any level meaningthat you can compute exact higher-order derivatives and differentiate functions that are internally making use ofdifferentiation for applications such as hyperparameter optimization

bull Vulpes - Deep belief and deep learning implementation written in F and leverages CUDA GPU execution withAleacuBase

bull Encog - An advanced neural network and machine learning framework Encog contains classes to create a widevariety of networks as well as support classes to normalize and process data for these neural networks Encogtrains using multithreaded resilient propagation Encog can also make use of a GPU to further speed processingtime A GUI based workbench is also provided to help model and train neural networks

bull Neural Network Designer - DBMS management system and designer for neural networks The designer appli-cation is developed using WPF and is a user interface which allows you to design your neural network querythe network create and configure chat bots that are capable of asking questions and learning from your feedback The chat bots can even scrape the internet for information to return in their output as well as to use forlearning

bull InferNET - InferNET is a framework for running Bayesian inference in graphical models One can use In-ferNET to solve many different kinds of machine learning problems from standard problems like classifica-tion recommendation or clustering through to customised solutions to domain-specific problems InferNET hasbeen used in a wide variety of domains including information retrieval bioinformatics epidemiology visionand many others

Data Analysis Data Visualization

bull numl - numl is a machine learning library intended to ease the use of using standard modeling techniques forboth prediction and clustering

bull MathNET Numerics - Numerical foundation of the MathNET project aiming to provide methods and algo-rithms for numerical computations in science engineering and every day use Supports Net 40 Net 35 andMono on Windows Linux and Mac Silverlight 5 WindowsPhoneSL 8 WindowsPhone 81 and Windows 8with PCL Portable Profiles 47 and 344 AndroidiOS with Xamarin

bull Sho to enable fast and flexible prototyping The environment includes powerful and efficient libraries for lin-ear algebra as well as data visualization that can be used from any NET language as well as a feature-richinteractive shell for rapid development

84 Chapter 21 Libraries

Data Science Primer Documentation

2116 Objective C

General-Purpose Machine Learning

bull YCML

bull MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X MLPNeuralNetpredicts new examples by trained neural network It is built on top of the Applersquos Accelerate Framework usingvectorized operations and hardware acceleration if available

bull MAChineLearning - An Objective-C multilayer perceptron library with full support for training through back-propagation Implemented using vDSP and vecLib itrsquos 20 times faster than its Java equivalent Includes samplecode for use from Swift

bull BPN-NeuralNetwork This network can be used in products recommendation user behavior analysis datamining and data analysis

bull Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers

bull KRHebbian-Algorithm in neural network of Machine Learning

bull KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm It could be usedin data mining and image compression

bull KRFuzzyCMeans-Algorithm the fuzzy clustering classification algorithm on Machine Learning It could beused in data mining and image compression

2117 OCaml

General-Purpose Machine Learning

bull Oml - A general statistics and machine learning library

bull GPR - Efficient Gaussian Process Regression in OCaml

bull Libra-Tk - Algorithms for learning and inference with discrete probabilistic models

bull TensorFlow - OCaml bindings for TensorFlow

2118 PHP

Natural Language Processing

bull jieba-php - Chinese Words Segmentation Utilities

General-Purpose Machine Learning

bull PHP-ML - Machine Learning library for PHP Algorithms Cross Validation Neural Network PreprocessingFeature Extraction and much more in one library

bull PredictionBuilder - A library for machine learning that builds predictions using a linear regression

2116 Objective C 85

Data Science Primer Documentation

2119 Python

Computer Vision

bull Scikit-Image - A collection of algorithms for image processing in Python

bull SimpleCV - An open source computer vision framework that gives access to several high-powered computervision libraries such as OpenCV Written on Python and runs on Mac Windows and Ubuntu Linux

bull Vigranumpy - Python bindings for the VIGRA C++ computer vision library

bull OpenFace - Free and open source face recognition with deep neural networks

bull PCV - Open source Python module for computer vision

Natural Language Processing

bull NLTK - A leading platform for building Python programs to work with human language data

bull Pattern - A web mining module for the Python programming language It has tools for natural language pro-cessing machine learning among others

bull Quepy - A python framework to transform natural language questions to queries in a database query language

bull TextBlob tasks Stands on the giant shoulders of NLTK and Pattern and plays nicely with both

bull YAlign - A sentence aligner a friendly tool for extracting parallel sentences from comparable corpora

bull jieba - Chinese Words Segmentation Utilities

bull SnowNLP - A library for processing Chinese text

bull spammy - A library for email Spam filtering built on top of nltk

bull loso - Another Chinese segmentation library

bull genius - A Chinese segment base on Conditional Random Field

bull KoNLPy - A Python package for Korean natural language processing

bull nut - Natural language Understanding Toolkit

bull Rosetta

bull BLLIP Parser

bull PyNLPl](httpsgithubcomproyconpynlpl) - Python Natural Language Processing Library General purposeNLP library for Python Also contains some specific modules for parsing common NLP formats most notablyfor [FoLiA but also ARPA language models Moses phrasetables GIZA++ alignments

bull python-ucto

bull python-frog

bull python-zpar](httpsgithubcomEducationalTestingServicepython-zpar) - Python bindings for [ZPar a statisti-cal part-of-speech-tagger constiuency parser and dependency parser for English

bull colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructionssuch as n-grams and skipgrams in a quick and memory-efficient way

bull spaCy - Industrial strength NLP with Python and Cython

bull PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies

bull Distance - Levenshtein and Hamming distance computation

86 Chapter 21 Libraries

Data Science Primer Documentation

bull Fuzzy Wuzzy - Fuzzy String Matching in Python

bull jellyfish - a python library for doing approximate and phonetic matching of strings

bull editdistance - fast implementation of edit distance

bull textacy - higher-level NLP built on Spacy

bull stanford-corenlp-python](httpsgithubcomdasmithstanford-corenlp-python) - Python wrapper for [StanfordCoreNLP

General-Purpose Machine Learning

bull auto_ml - Automated machine learning for production and analytics Lets you focus on the fun parts of MLwhile outputting production-ready code and detailed analytics of your dataset and results Includes support forNLP XGBoost LightGBM and soon deep learning

bull machine learning](httpsgithubcomjeff1evesquemachine-learning) - automated build consisting of a[web-interface](httpsgithubcomjeff1evesquemachine-learningweb-interface) and set of [programmatic-interface are stored into a NoSQL datastore

bull XGBoost Library

bull Bayesian Methods for Hackers - BookiPython notebooks on Probabilistic Programming in Python

bull Featureforge A set of tools for creating and testing machine learning features with a scikit-learn compatibleAPI

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull scikit-learn - A Python module for machine learning built on top of SciPy

bull metric-learn - A Python module for metric learning

bull SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book ldquoAr-tificial Intelligence a Modern Approachrdquo It focuses on providing an easy to use well documented and testedlibrary

bull astroML - Machine Learning and Data Mining for Astronomy

bull graphlab-create implemented on top of a disk-backed DataFrame

bull BigML - A library that contacts external servers

bull pattern - Web mining module for Python

bull NuPIC - Numenta Platform for Intelligent Computing

bull Pylearn2](httpsgithubcomlisa-labpylearn2) - A Machine Learning library based on [Theano

bull keras](httpsgithubcomfcholletkeras) - Modular neural network library based on [Theano

bull Lasagne - Lightweight library to build and train neural networks in Theano

bull hebel - GPU-Accelerated Deep Learning Library in Python

bull Chainer - Flexible neural network framework

bull prohpet - Fast and automated time series forecasting framework by Facebook

bull gensim - Topic Modelling for Humans

bull topik - Topic modelling toolkit

2119 Python 87

Data Science Primer Documentation

bull PyBrain - Another Python Machine Learning Library

bull Brainstorm - Fast flexible and fun neural networks This is the successor of PyBrain

bull Crab - A exible fast recommender engine

bull python-recsys - A Python library for implementing a Recommender System

bull thinking bayes - Book on Bayesian Analysis

bull Image-to-Image Translation with Conditional Adversarial Networks](httpsgithubcomwilliamFalconpix2pix-keras) - Implementation of image to image (pix2pix) translation from the paper by [isola et al[DEEPLEARNING]

bull Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python [DEEP LEARNING]

bull Bolt - Bolt Online Learning Toolbox

bull CoverTree - Python implementation of cover trees near-drop-in replacement for scipyspatialkdtree

bull nilearn - Machine learning for NeuroImaging in Python

bull imbalanced-learn - Python module to perform under sampling and over sampling with various techniques

bull Shogun - The Shogun Machine Learning Toolbox

bull Pyevolve - Genetic algorithm framework

bull Caffe - A deep learning framework developed with cleanliness readability and speed in mind

bull breze - Theano based library for deep and recurrent neural networks

bull pyhsmm focusing on the Bayesian Nonparametric extensions the HDP-HMM and HDP-HSMM mostly withweak-limit approximations

bull mrjob - A library to let Python program run on Hadoop

bull SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments

bull neurolab - httpsgithubcomzueveneurolab

bull Spearmint - Spearmint is a package to perform Bayesian optimization according to the algorithms outlined inthe paper Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek Hugo Larochelleand Ryan P Adams Advances in Neural Information Processing Systems 2012

bull Pebl - Python Environment for Bayesian Learning

bull Theano - Optimizing GPU-meta-programming code generating array oriented optimizing math compiler inPython

bull TensorFlow - Open source software library for numerical computation using data flow graphs

bull yahmm - Hidden Markov Models for Python implemented in Cython for speed and efficiency

bull python-timbl - A Python extension module wrapping the full TiMBL C++ programming interface Timbl is anelaborate k-Nearest Neighbours machine learning toolkit

bull deap - Evolutionary algorithm framework

bull pydeep - Deep Learning In Python

bull mlxtend - A library consisting of useful tools for data science and machine learning tasks

bull neon](httpsgithubcomNervanaSystemsneon) - Nervanarsquos [high-performance Python-based Deep Learningframework [DEEP LEARNING]

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search

88 Chapter 21 Libraries

Data Science Primer Documentation

bull Neural Networks and Deep Learning - Code samples for my book ldquoNeural Networks and Deep Learningrdquo [DEEPLEARNING]

bull Annoy - Approximate nearest neighbours implementation

bull skflow - Simplified interface for TensorFlow mimicking Scikit Learn

bull TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programmingConsider it your personal data science assistant automating a tedious part of machine learning

bull pgmpy A python library for working with Probabilistic Graphical Models

bull DIGITS is a web application for training deep learning models

bull Orange - Open source data visualization and data analysis for novices and experts

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull milk - Machine learning toolkit focused on supervised classification

bull TFLearn - Deep learning library featuring a higher-level API for TensorFlow

bull REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible wayREP is not trying to substitute scikit-learn but extends it and provides better user experience

bull rgf_python Library

bull gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms

bull skbayes - Python package for Bayesian Machine Learning with scikit-learn API

bull fuku-ml - Simple machine learning library including Perceptron Regression Support Vector Machine DecisionTree and more itrsquos easy to use and easy to learn for beginners

Data Analysis Data Visualization

bull SciPy - A Python-based ecosystem of open-source software for mathematics science and engineering

bull NumPy - A fundamental package for scientific computing with Python

bull Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy

bull NetworkX - A high-productivity software for complex networks

bull igraph - binding to igraph library - General purpose graph library

bull Pandas - A library providing high-performance easy-to-use data structures and data analysis tools

bull Open Mining

bull PyMC - Markov Chain Monte Carlo sampling toolkit

bull zipline - A Pythonic algorithmic trading library

bull PyDy - Short for Python Dynamics used to assist with workflow in the modeling of dynamic motion basedaround NumPy SciPy IPython and matplotlib

bull SymPy - A Python library for symbolic mathematics

bull statsmodels - Statistical modeling and econometrics in Python

bull astropy - A community Python library for Astronomy

bull matplotlib - A Python 2D plotting library

bull bokeh - Interactive Web Plotting for Python

2119 Python 89

Data Science Primer Documentation

bull plotly - Collaborative web plotting for Python and matplotlib

bull vincent - A Python to Vega translator

bull d3py](httpsgithubcommikedeward3py) - A plotting library for Python based on [D3js

bull PyDexter - Simple plotting for Python Wrapper for D3xterjs easily render charts in-browser

bull ggplot - Same API as ggplot2 for R

bull ggfortify - Unified interface to ggplot2 popular R packages

bull Kartographpy - Rendering beautiful SVG maps in Python

bull pygal - A Python SVG Charts Creator

bull PyQtGraph - A pure-python graphics and GUI library built on PyQt4 PySide and NumPy

bull pycascading

bull Petrel - Tools for writing submitting debugging and monitoring Storm topologies in pure Python

bull Blaze - NumPy and Pandas interface to Big Data

bull emcee - The Python ensemble sampling toolkit for affine-invariant MCMC

bull windML - A Python Framework for Wind Energy Analysis and Prediction

bull vispy - GPU-based high-performance interactive OpenGL 2D3D data visualization library

bull cerebro2 A web-based visualization and debugging platform for NuPIC

bull NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool

bull SparklingPandas

bull Seaborn - A python visualization library based on matplotlib

bull bqplot

bull pastalog - Simple realtime visualization of neural network training performance

bull caravel - A data exploration platform designed to be visual intuitive and interactive

bull Dora - Tools for exploratory data analysis in Python

bull Ruffus - Computation Pipeline library for python

bull SOMPY

bull somoclu Massively parallel self-organizing maps accelerate training on multicore CPUs GPUs and clustershas python API

bull HDBScan - implementation of the hdbscan algorithm in Python - used for clustering

bull visualize_ML - A python package for data exploration and data analysis

bull scikit-plot - A visualization library for quick and easy generation of common plots in data analysis and machinelearning

Neural networks

bull Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networksthat describe images with sentences

bull Neuron neural networks learned with Gradient descent or LeLevenbergndashMarquardt algorithm

90 Chapter 21 Libraries

Data Science Primer Documentation

bull Data Driven Code - Very simple implementation of neural networks for dummies in python without using anylibraries with detailed comments

2120 Ruby

Natural Language Processing

bull Treat - Text REtrieval and Annotation Toolkit definitely the most comprehensive toolkit Irsquove encountered so farfor Ruby

bull Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language Itincludes a generic language-independent front end a module for mapping language codes into language namesand a module which contains various English-language utilities

bull Stemmer - Expose libstemmer_c to Ruby

bull Ruby Wordnet - This library is a Ruby interface to WordNet

bull Raspel - raspell is an interface binding for ruby

bull UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

bull Twitter-text-rb - A library that does auto linking and extraction of usernames lists and hashtags in tweets

General-Purpose Machine Learning

bull Ruby Machine Learning - Some Machine Learning algorithms implemented in Ruby

bull Machine Learning Ruby

bull jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby

bull CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications

bull rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector Machines

bull Random Forester - Creates Random Forest classifiers from PMML files

Data Analysis Data Visualization

bull rsruby - Ruby - R bridge

bull data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visual-isation with Ruby

bull ruby-plot - gnuplot wrapper for ruby especially for plotting roc curves into svg files

bull plot-rb - A plotting library in Ruby built on top of Vega and D3

bull scruffy - A beautiful graphing toolkit for Ruby

bull SciRuby

bull Glean - A data management tool for humans

bull Bioruby

bull Arel

2120 Ruby 91

Data Science Primer Documentation

Misc

bull Big Data For Chimps

bull Listof](httpsgithubcomkevincobain2000listof) - Community based data collection packed in gem Get listof pretty much anything (stop words countries non words) in txt json or hash [DemoSearch for a list

2121 Rust

General-Purpose Machine Learning

bull deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication addition and ReLU underthe MIT license

bull rustlearn - a machine learning framework featuring logistic regression support vector machines decision treesand random forests

bull rusty-machine - a pure-rust machine learning library

bull leaf](httpsgithubcomautumnaileaf) - open source framework for machine intelligence sharing conceptsfrom TensorFlow and Caffe Available under the MIT license [[Deprecated]

bull RustNN - RustNN is a feedforward neural network library

2122 R

General-Purpose Machine Learning

bull ahaz - ahaz Regularization for semiparametric additive hazards regression

bull arules - arules Mining Association Rules and Frequent Itemsets

bull biglasso - biglasso Extending Lasso Model Fitting to Big Data in R

bull bigrf - bigrf Big Random Forests Classification and Regression Forests for Large Data Sets

bull lsquobigRR lthttpcranr-projectorgwebpackagesbigRRindexhtml) - bigRR Generalized Ridge Regres-sion (with special advantage for p gtgt n casesgtlsquo__

bull bmrm - bmrm Bundle Methods for Regularized Risk Minimization Package

bull Boruta - Boruta A wrapper algorithm for all-relevant feature selection

bull bst - bst Gradient Boosting

bull C50 - C50 C50 Decision Trees and Rule-Based Models

bull caret - Classification and Regression Training Unified interface to ~150 ML algorithms in R

bull caretEnsemble - caretEnsemble Framework for fitting multiple caret models as well as creating ensembles ofsuch models

bull Clever Algorithms For Machine Learning

bull CORElearn - CORElearn Classification regression feature evaluation and ordinal evaluation

bull CoxBoost - CoxBoost Cox models by likelihood based boosting for a single survival endpoint or competingrisks

bull Cubist - Cubist Rule- and Instance-Based Regression Modeling

92 Chapter 21 Libraries

Data Science Primer Documentation

bull e1071 TU Wien

bull earth - earth Multivariate Adaptive Regression Spline Models

bull elasticnet - elasticnet Elastic-Net for Sparse Estimation and Sparse PCA

bull ElemStatLearn - ElemStatLearn Data sets functions and examples from the book ldquoThe Elements of StatisticalLearning Data Mining Inference and Predictionrdquo by Trevor Hastie Robert Tibshirani and Jerome FriedmanPredictionrdquo by Trevor Hastie Robert Tibshirani and Jerome Friedman

bull evtree - evtree Evolutionary Learning of Globally Optimal Trees

bull forecast - forecast Timeseries forecasting using ARIMA ETS STLM TBATS and neural network models

bull forecastHybrid - forecastHybrid Automatic ensemble and cross validation of ARIMA ETS STLM TBATSand neural network models from the ldquoforecastrdquo package

bull fpc - fpc Flexible procedures for clustering

bull frbs - frbs Fuzzy Rule-based Systems for Classification and Regression Tasks

bull GAMBoost - GAMBoost Generalized linear and additive models by likelihood based boosting

bull gamboostLSS - gamboostLSS Boosting Methods for GAMLSS

bull gbm - gbm Generalized Boosted Regression Models

bull glmnet - glmnet Lasso and elastic-net regularized generalized linear models

bull glmpath - glmpath L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

bull GMMBoost - GMMBoost Likelihood-based Boosting for Generalized mixed models

bull grplasso - grplasso Fitting user specified models with Group Lasso penalty

bull grpreg - grpreg Regularization paths for regression models with grouped covariates

bull h2o - A framework for fast parallel and distributed machine learning algorithms at scale ndash DeeplearningRandom forests GBM KMeans PCA GLM

bull hda - hda Heteroscedastic Discriminant Analysis

bull Introduction to Statistical Learning

bull ipred - ipred Improved Predictors

bull kernlab - kernlab Kernel-based Machine Learning Lab

bull klaR - klaR Classification and visualization

bull lars - lars Least Angle Regression Lasso and Forward Stagewise

bull lasso2 - lasso2 L1 constrained estimation aka lsquolassorsquo

bull LiblineaR - LiblineaR Linear Predictive Models Based On The Liblinear CC++ Library

bull LogicReg - LogicReg Logic Regression

bull Machine Learning For Hackers

bull maptree - maptree Mapping pruning and graphing tree models

bull mboost - mboost Model-Based Boosting

bull medley - medley Blending regression models using a greedy stepwise approach

bull mlr - mlr Machine Learning in R

bull mvpart - mvpart Multivariate partitioning

2122 R 93

Data Science Primer Documentation

bull ncvreg - ncvreg Regularization paths for SCAD- and MCP-penalized regression models

bull nnet - nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

bull obliquetree - obliquetree Oblique Trees for Classification Data

bull pamr - pamr Pam prediction analysis for microarrays

bull party - party A Laboratory for Recursive Partytioning

bull partykit - partykit A Toolkit for Recursive Partytioning

bull penalized penalized estimation in GLMs and in the Cox model

bull penalizedLDA - penalizedLDA Penalized classification using Fisherrsquos linear discriminant

bull penalizedSVM - penalizedSVM Feature Selection SVM using penalty functions

bull quantregForest - quantregForest Quantile Regression Forests

bull randomForest - randomForest Breiman and Cutlerrsquos random forests for classification and regression

bull randomForestSRC

bull rattle - rattle Graphical user interface for data mining in R

bull rda - rda Shrunken Centroids Regularized Discriminant Analysis

bull rdetools in Feature Spaces

bull REEMtree Data

bull relaxo - relaxo Relaxed Lasso

bull rgenoud - rgenoud R version of GENetic Optimization Using Derivatives

bull rgp - rgp R genetic programming framework

bull Rmalschains in R

bull rminer in classification and regression

bull ROCR - ROCR Visualizing the performance of scoring classifiers

bull RoughSets - RoughSets Data Analysis Using Rough Set and Fuzzy Rough Set Theories

bull rpart - rpart Recursive Partitioning and Regression Trees

bull RPMM - RPMM Recursively Partitioned Mixture Model

bull RSNNS

bull RWeka - RWeka RWeka interface

bull RXshrink - RXshrink Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression

bull sda - sda Shrinkage Discriminant Analysis and CAT Score Variable Selection

bull SDDA - SDDA Stepwise Diagonal Discriminant Analysis

bull SuperLearner](httpsgithubcomecpolleySuperLearner) and [subsemble - Multi-algorithm ensemble learningpackages

bull svmpath - svmpath svmpath the SVM Path algorithm

bull tgp - tgp Bayesian treed Gaussian process models

bull tree - tree Classification and regression trees

bull varSelRF - varSelRF Variable selection using random forests

94 Chapter 21 Libraries

Data Science Primer Documentation

bull XGBoostR Library

bull Optunity - A library dedicated to automated hyperparameter optimization with a simple lightweight API tofacilitate drop-in replacement of grid search Optunity is written in Python but interfaces seamlessly to R

bull igraph - binding to igraph library - General purpose graph library

bull MXNet - Lightweight Portable Flexible DistributedMobile Deep Learning with Dynamic Mutation-awareDataflow Dep Scheduler for Python R Julia Go Javascript and more

bull TDSP-Utilities

Data Analysis Data Visualization

bull ggplot2 - A data visualization package based on the grammar of graphics

2123 SAS

General-Purpose Machine Learning

bull Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code

bull Factory Miner - Automatically creates deployable machine learning models across numerous market or customersegments using a GUI

Data Analysis Data Visualization

bull SASSTAT - For conducting advanced statistical analysis

bull University Edition - FREE Includes all SAS packages necessary for data analysis and visualization and in-cludes online SAS courses

High Performance Machine Learning

bull High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUIor code in an MPP environment including Hadoop

bull High Performance Text Mining - Text mining using a GUI or code in an MPP environment including Hadoop

Natural Language Processing

bull Contextual Analysis - Add structure to unstructured text using a GUI

bull Sentiment Analysis - Extract sentiment from text using a GUI

bull Text Miner - Text mining using a GUI or code

Demos and Scripts

bull ML_Tables - Concise cheat sheets containing machine learning best practices

bull enlighten-apply - Example code and materials that illustrate applications of SAS machine learning techniques

2123 SAS 95

Data Science Primer Documentation

bull enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with otheranalytics technologies in Java PMML Python and R

bull enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers inSAS

bull dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specificdata mining topics

2124 Scala

Natural Language Processing

bull ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries

bull Breeze - Breeze is a numerical processing library for Scala

bull Chalk - Chalk is a natural language processing library

bull FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling implemented as a software library inScala It provides its users with a succinct language for creating relational factor graphs estimating parametersand performing inference

Data Analysis Data Visualization

bull MLlib in Apache Spark - Distributed machine learning library in Spark

bull Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime batchor reactive web services

bull Scalding - A Scala API for Cascading

bull Summing Bird - Streaming MapReduce with Scalding and Storm

bull Algebird - Abstract Algebra for Scala

bull xerial - Data management utilities for Scala

bull simmer - Reduce your data A unix filter for algebird-powered aggregation

bull PredictionIO - PredictionIO a machine learning server for software developers and data engineers

bull BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis

bull Wolfe Declarative Machine Learning

bull Flink - Open source platform for distributed stream and batch data processing

bull Spark Notebook - Interactive and Reactive Data Science using Scala and Spark

General-Purpose Machine Learning

bull Conjecture - Scalable Machine Learning in Scalding

bull brushfire - Distributed decision tree ensemble learning in Scala

bull ganitha - scalding powered machine learning

bull adam - A genomics processing engine and specialized file format built using Apache Avro Apache Spark andParquet Apache 2 licensed

96 Chapter 21 Libraries

Data Science Primer Documentation

bull bioscala - Bioinformatics for the Scala programming language

bull BIDMach - CPU and GPU-accelerated Machine Learning Library

bull Figaro - a Scala library for constructing probabilistic models

bull H2O Sparkling Water - H2O and Spark interoperability

bull FlinkML in Apache Flink - Distributed machine learning library in Flink

bull DynaML - Scala LibraryREPL for Machine Learning Research

bull Saul - Flexible Declarative Learning-Based Programming

bull SwiftLearner - Simply written algorithms to help study ML or write your own implementations

2125 Swift

General-Purpose Machine Learning

bull Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift

bull BrainCore - The iOS and OS X neural network framework

bull swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-ment

bull DeepLearningKit an Open Source Deep Learning Framework for Applersquos iOS OS X and tvOS It currentlyallows using deep convolutional neural network models trained in Caffe on Apple operating systems

bull AIToolbox - A toolbox framework of AI modules written in Swift GraphsTrees Linear Regression SupportVector Machines Neural Networks PCA KMeans Genetic Algorithms MDP Mixture of Gaussians

bull MLKit - A simple Machine Learning Framework written in Swift Currently features Simple Linear RegressionPolynomial Regression and Ridge Regression

bull Swift Brain - The first neural network machine learning library written in Swift This is a project for AIalgorithms in Swift for iOS and OS X development This project includes algorithms focused on Bayes theoremneural networks SVMs Matrices etc

2125 Swift 97

Data Science Primer Documentation

98 Chapter 21 Libraries

CHAPTER 22

Papers

bull Machine Learning

bull Deep Learning

ndash Understanding

ndash Optimization Training Techniques

ndash Unsupervised Generative Models

ndash Image Segmentation Object Detection

ndash Image Video

ndash Natural Language Processing

ndash Speech Other

ndash Reinforcement Learning

ndash New papers

ndash Classic Papers

221 Machine Learning

Be the first to contribute

222 Deep Learning

Forked from terryumrsquos awesome deep learning papers

99

Data Science Primer Documentation

2221 Understanding

bull Distilling the knowledge in a neural network (2015) G Hinton et al [pdf]

bull Deep neural networks are easily fooled High confidence predictions for unrecognizable images (2015) ANguyen et al [pdf]

bull How transferable are features in deep neural networks (2014) J Yosinski et al [pdf]

bull CNN features off-the-Shelf An astounding baseline for recognition (2014) A Razavian et al [pdf]

bull Learning and transferring mid-Level image representations using convolutional neural networks (2014) MOquab et al [pdf]

bull Visualizing and understanding convolutional networks (2014) M Zeiler and R Fergus [pdf]

bull Decaf A deep convolutional activation feature for generic visual recognition (2014) J Donahue et al [pdf]

2222 Optimization Training Techniques

bull Batch normalization Accelerating deep network training by reducing internal covariate shift (2015) S Loffeand C Szegedy [pdf]

bull Delving deep into rectifiers Surpassing human-level performance on imagenet classification (2015) K He etal [pdf]

bull Dropout A simple way to prevent neural networks from overfitting (2014) N Srivastava et al [pdf]

bull Adam A method for stochastic optimization (2014) D Kingma and J Ba [pdf]

bull Improving neural networks by preventing co-adaptation of feature detectors (2012) G Hinton et al [pdf]

bull Random search for hyper-parameter optimization (2012) J Bergstra and Y Bengio [pdf]

2223 Unsupervised Generative Models

bull Pixel recurrent neural networks (2016) A Oord et al [pdf]

bull Improved techniques for training GANs (2016) T Salimans et al [pdf]

bull Unsupervised representation learning with deep convolutional generative adversarial networks (2015) A Rad-ford et al [pdf]

bull DRAW A recurrent neural network for image generation (2015) K Gregor et al [pdf]

bull Generative adversarial nets (2014) I Goodfellow et al [pdf]

bull Auto-encoding variational Bayes (2013) D Kingma and M Welling [pdf]

bull Building high-level features using large scale unsupervised learning (2013) Q Le et al [pdf]

2224 Image Segmentation Object Detection

bull You only look once Unified real-time object detection (2016) J Redmon et al [pdf]

bull Fully convolutional networks for semantic segmentation (2015) J Long et al [pdf]

bull Faster R-CNN Towards Real-Time Object Detection with Region Proposal Networks (2015) S Ren et al [pdf]

bull Fast R-CNN (2015) R Girshick [pdf]

bull Rich feature hierarchies for accurate object detection and semantic segmentation (2014) R Girshick et al [pdf]

100 Chapter 22 Papers

Data Science Primer Documentation

bull Semantic image segmentation with deep convolutional nets and fully connected CRFs L Chen et al [pdf]

bull Learning hierarchical features for scene labeling (2013) C Farabet et al [pdf]

2225 Image Video

bull Image Super-Resolution Using Deep Convolutional Networks (2016) C Dong et al [pdf]

bull A neural algorithm of artistic style (2015) L Gatys et al [pdf]

bull Deep visual-semantic alignments for generating image descriptions (2015) A Karpathy and L Fei-Fei [pdf]

bull Show attend and tell Neural image caption generation with visual attention (2015) K Xu et al [pdf]

bull Show and tell A neural image caption generator (2015) O Vinyals et al [pdf]

bull Long-term recurrent convolutional networks for visual recognition and description (2015) J Donahue et al[pdf]

bull VQA Visual question answering (2015) S Antol et al [pdf]

bull DeepFace Closing the gap to human-level performance in face verification (2014) Y Taigman et al [pdf]

bull Large-scale video classification with convolutional neural networks (2014) A Karpathy et al [pdf]

bull DeepPose Human pose estimation via deep neural networks (2014) A Toshev and C Szegedy [pdf]

bull Two-stream convolutional networks for action recognition in videos (2014) K Simonyan et al [pdf]

bull 3D convolutional neural networks for human action recognition (2013) S Ji et al [pdf]

2226 Natural Language Processing

bull Neural Architectures for Named Entity Recognition (2016) G Lample et al [pdf]

bull Exploring the limits of language modeling (2016) R Jozefowicz et al [pdf]

bull Teaching machines to read and comprehend (2015) K Hermann et al [pdf]

bull Effective approaches to attention-based neural machine translation (2015) M Luong et al [pdf]

bull Conditional random fields as recurrent neural networks (2015) S Zheng and S Jayasumana [pdf]

bull Memory networks (2014) J Weston et al [pdf]

bull Neural turing machines (2014) A Graves et al [pdf]

bull Neural machine translation by jointly learning to align and translate (2014) D Bahdanau et al [pdf]

bull Sequence to sequence learning with neural networks (2014) I Sutskever et al [pdf]

bull Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) K Choet al [pdf]

bull A convolutional neural network for modeling sentences (2014) N Kalchbrenner et al [pdf]

bull Convolutional neural networks for sentence classification (2014) Y Kim [pdf]

bull Glove Global vectors for word representation (2014) J Pennington et al [pdf]

bull Distributed representations of sentences and documents (2014) Q Le and T Mikolov [pdf]

bull Distributed representations of words and phrases and their compositionality (2013) T Mikolov et al [pdf]

bull Efficient estimation of word representations in vector space (2013) T Mikolov et al [pdf]

222 Deep Learning 101

Data Science Primer Documentation

bull Recursive deep models for semantic compositionality over a sentiment treebank (2013) R Socher et al [pdf]

bull Generating sequences with recurrent neural networks (2013) A Graves [pdf]

2227 Speech Other

bull End-to-end attention-based large vocabulary speech recognition (2016) D Bahdanau et al [pdf]

bull Deep speech 2 End-to-end speech recognition in English and Mandarin (2015) D Amodei et al [pdf]

bull Speech recognition with deep recurrent neural networks (2013) A Graves [pdf]

bull Deep neural networks for acoustic modeling in speech recognition The shared views of four research groups(2012) G Hinton et al [pdf]

bull Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G Dahl etal [pdf]

bull Acoustic modeling using deep belief networks (2012) A Mohamed et al [pdf]

2228 Reinforcement Learning

bull End-to-end training of deep visuomotor policies (2016) S Levine et al [pdf]

bull Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection(2016) S Levine et al [pdf]

bull Asynchronous methods for deep reinforcement learning (2016) V Mnih et al [pdf]

bull Deep Reinforcement Learning with Double Q-Learning (2016) H Hasselt et al [pdf]

bull Mastering the game of Go with deep neural networks and tree search (2016) D Silver et al [pdf]

bull Continuous control with deep reinforcement learning (2015) T Lillicrap et al [pdf]

bull Human-level control through deep reinforcement learning (2015) V Mnih et al [pdf]

bull Deep learning for detecting robotic grasps (2015) I Lenz et al [pdf]

bull Playing atari with deep reinforcement learning (2013) V Mnih et al [pdf]

2229 New papers

bull Deep Photo Style Transfer (2017) F Luan et al [pdf]

bull Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017) T Salimans et al [pdf]

bull Deformable Convolutional Networks (2017) J Dai et al [pdf]

bull Mask R-CNN (2017) K He et al [pdf]

bull Learning to discover cross-domain relations with generative adversarial networks (2017) T Kim et al [pdf]

bull Deep voice Real-time neural text-to-speech (2017) S Arik et al [pdf]

bull PixelNet Representation of the pixels by the pixels and for the pixels (2017) A Bansal et al [pdf]

bull Batch renormalization Towards reducing minibatch dependence in batch-normalized models (2017) S Ioffe[pdf]

bull Wasserstein GAN (2017) M Arjovsky et al [pdf]

bull Understanding deep learning requires rethinking generalization (2017) C Zhang et al [pdf]

102 Chapter 22 Papers

Data Science Primer Documentation

bull Least squares generative adversarial networks (2016) X Mao et al [pdf]

22210 Classic Papers

bull An analysis of single-layer networks in unsupervised feature learning (2011) A Coates et al [pdf]

bull Deep sparse rectifier neural networks (2011) X Glorot et al [pdf]

bull Natural language processing (almost) from scratch (2011) R Collobert et al [pdf]

bull Recurrent neural network based language model (2010) T Mikolov et al [pdf]

bull Stacked denoising autoencoders Learning useful representations in a deep network with a local denoisingcriterion (2010) P Vincent et al [pdf]

bull Learning mid-level features for recognition (2010) Y Boureau [pdf]

bull A practical guide to training restricted boltzmann machines (2010) G Hinton [pdf]

bull Understanding the difficulty of training deep feedforward neural networks (2010) X Glorot and Y Bengio [pdf]

bull Why does unsupervised pre-training help deep learning (2010) D Erhan et al [pdf]

bull Learning deep architectures for AI (2009) Y Bengio [pdf]

bull Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009)H Lee et al [pdf]

bull Greedy layer-wise training of deep networks (2007) Y Bengio et al [pdf]

bull A fast learning algorithm for deep belief nets (2006) G Hinton et al [pdf]

bull Gradient-based learning applied to document recognition (1998) Y LeCun et al [pdf]

bull Long short-term memory (1997) S Hochreiter and J Schmidhuber [pdf]

222 Deep Learning 103

Data Science Primer Documentation

104 Chapter 22 Papers

CHAPTER 23

Other Content

Books blogs courses and more forked from josephmisitirsquos awesome machine learning

bull Blogs

ndash Data Science

ndash Machine learning

ndash Math

bull Books

ndash Machine learning

ndash Deep learning

ndash Probability amp Statistics

ndash Linear Algebra

bull Courses

bull Podcasts

bull Tutorials

231 Blogs

2311 Data Science

bull httpsjeremykuncom

bull httpiamtraskgithubio

bull httpblogexplainmydatacom

105

Data Science Primer Documentation

bull httpandrewgelmancom

bull httpsimplystatisticsorg

bull httpwwwevanmillerorg

bull httpjakevdpgithubio

bull httpblogyhatcom

bull httpwesmckinneycom

bull httpwwwoverkillanalyticsnet

bull httpnewtoncx~peter

bull httpmbakker7githubioexploratory_computing_with_python

bull httpssebastianraschkacomblogindexhtml

bull httpcamdavidsonpilongithubioProbabilistic-Programming-and-Bayesian-Methods-for-Hackers

bull httpcolahgithubio

bull httpwwwthomasdimsoncom

bull httpblogsmellthedatacom

bull httpssebastianraschkacom

bull httpdogdogfishcom

bull httpwwwjohnmyleswhitecom

bull httpdrewconwaycomzia

bull httpbugragithubio

bull httpopendatacernch

bull httpsalexanderetzcom

bull httpwwwsumsarnet

bull httpswwwcountbayesiecom

bull httpblogkagglecom

bull httpwwwdanvkorg

bull httphunchnet

bull httpwwwrandalolsoncomblog

bull httpswwwjohndcookcomblogr_language_for_programmers

bull httpwwwdataschoolio

2312 Machine learning

bull OpenAI

bull Distill

bull Andrej Karpathy Blog

bull Colahrsquos Blog

bull WildML

106 Chapter 23 Other Content

Data Science Primer Documentation

bull FastML

bull TheMorningPaper

2313 Math

bull httpwwwsumsarnet

bull httpallendowneyblogspotca

bull httpshealthyalgorithmscom

bull httpspetewardencom

bull httpmrtzorgblog

232 Books

2321 Machine learning

bull Real World Machine Learning [Free Chapters]

bull An Introduction To Statistical Learning - Book + R Code

bull Elements of Statistical Learning - Book

bull Probabilistic Programming amp Bayesian Methods for Hackers - Book + IPython Notebooks

bull Think Bayes - Book + Python Code

bull Information Theory Inference and Learning Algorithms

bull Gaussian Processes for Machine Learning

bull Data Intensive Text Processing w MapReduce

bull Reinforcement Learning - An Introduction

bull Mining Massive Datasets

bull A First Encounter with Machine Learning

bull Pattern Recognition and Machine Learning

bull Machine Learning amp Bayesian Reasoning

bull Introduction to Machine Learning - Alex Smola and SVN Vishwanathan

bull A Probabilistic Theory of Pattern Recognition

bull Introduction to Information Retrieval

bull Forecasting principles and practice

bull Practical Artificial Intelligence Programming in Java

bull Introduction to Machine Learning - Amnon Shashua

bull Reinforcement Learning

bull Machine Learning

bull A Quest for AI

bull Introduction to Applied Bayesian Statistics and Estimation for Social Scientists - Scott M Lynch

232 Books 107

Data Science Primer Documentation

bull Bayesian Modeling Inference and Prediction

bull A Course in Machine Learning

bull Machine Learning Neural and Statistical Classification

bull Bayesian Reasoning and Machine Learning Book+MatlabToolBox

bull R Programming for Data Science

bull Data Mining - Practical Machine Learning Tools and Techniques Book

2322 Deep learning

bull Deep Learning - An MIT Press book

bull Coursera Course Book on NLP

bull NLTK

bull NLP w Python

bull Foundations of Statistical Natural Language Processing

bull An Introduction to Information Retrieval

bull A Brief Introduction to Neural Networks

bull Neural Networks and Deep Learning

2323 Probability amp Statistics

bull Think Stats - Book + Python Code

bull From Algorithms to Z-Scores - Book

bull The Art of R Programming

bull Introduction to statistical thought

bull Basic Probability Theory

bull Introduction to probability - By Dartmouth College

bull Principle of Uncertainty

bull Probability amp Statistics Cookbook

bull Advanced Data Analysis From An Elementary Point of View

bull Introduction to Probability - Book and course by MIT

bull The Elements of Statistical Learning Data Mining Inference and Prediction -Book

bull An Introduction to Statistical Learning with Applications in R - Book

bull Learning Statistics Using R

bull Introduction to Probability and Statistics Using R - Book

bull Advanced R Programming - Book

bull Practical Regression and Anova using R - Book

bull R practicals - Book

bull The R Inferno - Book

108 Chapter 23 Other Content

Data Science Primer Documentation

2324 Linear Algebra

bull Linear Algebra Done Wrong

bull Linear Algebra Theory and Applications

bull Convex Optimization

bull Applied Numerical Computing

bull Applied Numerical Linear Algebra

233 Courses

bull CS231n Convolutional Neural Networks for Visual Recognition Stanford University

bull CS224d Deep Learning for Natural Language Processing Stanford University

bull Oxford Deep NLP 2017 Deep Learning for Natural Language Processing University of Oxford

bull Artificial Intelligence (Columbia University) - free

bull Machine Learning (Columbia University) - free

bull Machine Learning (Stanford University) - free

bull Neural Networks for Machine Learning (University of Toronto) - free

bull Machine Learning Specialization (University of Washington) - Courses Machine Learning Foundations ACase Study Approach Machine Learning Regression Machine Learning Classification Machine Learn-ing Clustering amp Retrieval Machine Learning Recommender Systems amp Dimensionality ReductionMachineLearning Capstone An Intelligent Application with Deep Learning free

bull Machine Learning Course (2014-15 session) (by Nando de Freitas University of Oxford) - Lecture slides andvideo recordings

bull Learning from Data (by Yaser S Abu-Mostafa Caltech) - Lecture videos available

234 Podcasts

bull The OrsquoReilly Data Show

bull Partially Derivative

bull The Talking Machines

bull The Data Skeptic

bull Linear Digressions

bull Data Stories

bull Learning Machines 101

bull Not So Standard Deviations

bull TWIMLAI

233 Courses 109

Data Science Primer Documentation

235 Tutorials

Be the first to contribute

110 Chapter 23 Other Content

CHAPTER 24

Contribute

Become a contributor Check out our github for more information

111

  • Bash
  • Big Data
  • Databases
  • Data Engineering
  • Data Wrangling
  • Data Visualization
  • Deep Learning
  • Machine Learning
  • Python
  • Statistics
  • SQL
  • Glossary
  • Business
  • Ethics
  • Mastering The Data Science Interview
  • Learning How To Learn
  • Communication
  • Product
  • Stakeholder Management
  • Datasets
  • Libraries
  • Papers
  • Other Content
  • Contribute
Page 13: Data Science Primer Documentation
Page 14: Data Science Primer Documentation
Page 15: Data Science Primer Documentation
Page 16: Data Science Primer Documentation
Page 17: Data Science Primer Documentation
Page 18: Data Science Primer Documentation
Page 19: Data Science Primer Documentation
Page 20: Data Science Primer Documentation
Page 21: Data Science Primer Documentation
Page 22: Data Science Primer Documentation
Page 23: Data Science Primer Documentation
Page 24: Data Science Primer Documentation
Page 25: Data Science Primer Documentation
Page 26: Data Science Primer Documentation
Page 27: Data Science Primer Documentation
Page 28: Data Science Primer Documentation
Page 29: Data Science Primer Documentation
Page 30: Data Science Primer Documentation
Page 31: Data Science Primer Documentation
Page 32: Data Science Primer Documentation
Page 33: Data Science Primer Documentation
Page 34: Data Science Primer Documentation
Page 35: Data Science Primer Documentation
Page 36: Data Science Primer Documentation
Page 37: Data Science Primer Documentation
Page 38: Data Science Primer Documentation
Page 39: Data Science Primer Documentation
Page 40: Data Science Primer Documentation
Page 41: Data Science Primer Documentation
Page 42: Data Science Primer Documentation
Page 43: Data Science Primer Documentation
Page 44: Data Science Primer Documentation
Page 45: Data Science Primer Documentation
Page 46: Data Science Primer Documentation
Page 47: Data Science Primer Documentation
Page 48: Data Science Primer Documentation
Page 49: Data Science Primer Documentation
Page 50: Data Science Primer Documentation
Page 51: Data Science Primer Documentation
Page 52: Data Science Primer Documentation
Page 53: Data Science Primer Documentation
Page 54: Data Science Primer Documentation
Page 55: Data Science Primer Documentation
Page 56: Data Science Primer Documentation
Page 57: Data Science Primer Documentation
Page 58: Data Science Primer Documentation
Page 59: Data Science Primer Documentation
Page 60: Data Science Primer Documentation
Page 61: Data Science Primer Documentation
Page 62: Data Science Primer Documentation
Page 63: Data Science Primer Documentation
Page 64: Data Science Primer Documentation
Page 65: Data Science Primer Documentation
Page 66: Data Science Primer Documentation
Page 67: Data Science Primer Documentation
Page 68: Data Science Primer Documentation
Page 69: Data Science Primer Documentation
Page 70: Data Science Primer Documentation
Page 71: Data Science Primer Documentation
Page 72: Data Science Primer Documentation
Page 73: Data Science Primer Documentation
Page 74: Data Science Primer Documentation
Page 75: Data Science Primer Documentation
Page 76: Data Science Primer Documentation
Page 77: Data Science Primer Documentation
Page 78: Data Science Primer Documentation
Page 79: Data Science Primer Documentation
Page 80: Data Science Primer Documentation
Page 81: Data Science Primer Documentation
Page 82: Data Science Primer Documentation
Page 83: Data Science Primer Documentation
Page 84: Data Science Primer Documentation
Page 85: Data Science Primer Documentation
Page 86: Data Science Primer Documentation
Page 87: Data Science Primer Documentation
Page 88: Data Science Primer Documentation
Page 89: Data Science Primer Documentation
Page 90: Data Science Primer Documentation
Page 91: Data Science Primer Documentation
Page 92: Data Science Primer Documentation
Page 93: Data Science Primer Documentation
Page 94: Data Science Primer Documentation
Page 95: Data Science Primer Documentation
Page 96: Data Science Primer Documentation
Page 97: Data Science Primer Documentation
Page 98: Data Science Primer Documentation
Page 99: Data Science Primer Documentation
Page 100: Data Science Primer Documentation
Page 101: Data Science Primer Documentation
Page 102: Data Science Primer Documentation
Page 103: Data Science Primer Documentation
Page 104: Data Science Primer Documentation
Page 105: Data Science Primer Documentation
Page 106: Data Science Primer Documentation
Page 107: Data Science Primer Documentation
Page 108: Data Science Primer Documentation
Page 109: Data Science Primer Documentation
Page 110: Data Science Primer Documentation
Page 111: Data Science Primer Documentation
Page 112: Data Science Primer Documentation
Page 113: Data Science Primer Documentation
Page 114: Data Science Primer Documentation