the human algorithm: automating startup data collection at mattermark

32
#datapointlive The Human Algorithm: Automating Startup Data Collection at Mattermark Sarah Catanzaro, Head of Data at Mattermark @sarahcat21

Upload: janessa-lantz

Post on 09-Feb-2017

7.098 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: The Human Algorithm: Automating Startup Data Collection at Mattermark

#datapointlive

The Human Algorithm: Automating Startup Data Collection at Mattermark

Sarah Catanzaro, Head of Data at Mattermark @sarahcat21

Page 2: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Mattermark is a deal intelligence platform and private company database used by

● investors● business and corporate development● sales

Mattermark

Page 3: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

THE CHALLENGEScale + Information Overload +

Stealth

Page 4: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Scale

Over 125 million private companies in the world (only about 45.5 thousand public).

Page 5: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Information overload

Page 6: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Stealth

● Private companies do not have strong incentives (e.g. legal obligations) to share data. Many may have competitive incentives to obfuscate information.

● Investors may request non-disclosure.

Page 7: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Mattermark’s Solution

Page 8: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Software-oriented approach

● A must, due to the scale of our dataset○ 1.3 million companies○ 16.5k investors○ 110k funding events

● Leverage a lean data team

Page 9: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Data collection strategy

● Web scraping● Machine learning● Direct submission● Manual data entry

Page 10: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

The “Human Algorithm”

Page 11: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Investors ask questions like

What start-ups might raise capital in the next 6 months? What startups is

Stephanie Palmeri investing in?

Page 12: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Our data analysts seek to understand:

● Why does this question matter?● What data is required to answer this question?● Where can this data be accessed?

Page 13: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Next, data analysts:

1. Define repeatable processes for data collection. 2. Determine whether processes can be replicated

through web scraping and/or machine learning algorithms to collect data at scale.

3. Write functional specifications, reviewed by sales and engineering team members.

Page 14: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Next, web and/or machine learning engineers

1. Write dev designs, reviewed by data analysts.2. Upon implementation and marketing release,

this data becomes available to customers.3. New questions arise and the cycle starts again.

Page 15: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Funding Automation

Page 16: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Investors ask questions like

How much funding has a company already raised?

Who were the investors at each of those rounds?

Page 17: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Problems with existing sources

Rely on wiki-style data collection (cannot confirm the credibility of sources)

News reports are better; but ● facts are harder to extricate● different sources report different figures

Page 18: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Solution: funding automation

A new framework for collecting and synthesizing funding data.

1. News article fact extraction (machine learning)2. Funding override system (web engineering)3. Funding confirmation email campaign

(marketing)

Page 19: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

2. News article fact extractionCrawl RSS feeds, extract data from stories (title, texts, links, etc.)

● 750+ sources● 5,000 - 10,000 articles

Page 20: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

2. News article fact extraction

Classify stories about funding

● 250 articles/day

Page 21: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

2. News article fact extraction

● Identify sentences containing information about investors, amount, and/or series

Page 22: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

2. News article fact extraction

● Extract facts● Match companies and

investors to entities in our database○ 30% of extracted articles

are entered automatically

Page 23: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

1. Funding override system● Identify reports about the same funding event● Combine information from multiple reports using wongi rules engine

Page 24: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

3. Funding confirmation email campaign

Use CRM and Hubspot to automatically send emails to founders after equity financing.

Page 25: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

What We Learned

Page 26: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Where we struggled

Our initial implementation of a funding override system was inefficient. Why?

Because our data analysts and developers were not aligned on functional requirements.

Page 27: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Solution

● Analysts must work closely with developers○ Pre-spec check-ins○ Analysts review dev designs to ensure that

the system design addresses the use case.● Analysts must avoid being prescriptive● Analysts must understand data mining and

machine learning concepts

Page 28: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Where we succeeded

Implementation of news article fact extraction was successful. Why?

Because data analysts and developers worked as service providers to each other.

Page 29: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

How We Did It

Page 30: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

1. Tighter Analyst + Dev Communication

Tiger teams: 1 ML developer, 1 web/infrastructure developer, 1 data analyst, 1 project lead

Define milestones & hold daily stand-ups.

Page 31: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

3. Track II interaction reinforce symbiotic relationship

● Devs lead Python learning group● Data analysts hold seminars on topics like admin

tooling and alternative assets

Page 32: The Human Algorithm: Automating Startup Data Collection at Mattermark

#DPL15 | @sarahcat21

Thank You!