tamr | making enterprise elephants dance @ boston data festival

19
making enterprise elephants dance (gangnam style) Andy Palmer, Co-Founder & CEO Tamr

Upload: tamrinc

Post on 25-Jul-2015

91 views

Category:

Technology


1 download

TRANSCRIPT

making enterprise elephants dance (gangnam style)

Andy Palmer, Co-Founder & CEO Tamr

The View from 30,000 Feet … ok - from low earth orbit

The time has come to manage information across the enterprise for strategic benefit.

Be the “Googler” of your enterprise

Simply put : manage a company’s information as an asset - at least as well as Google tries to manage the world’s information as an asset

Assume your information assets are as diverse as the modern web - but not the same - data matters more than documents.

What does this mean?

However…MOST OF US ARE NOT GOOGLE in the level of quality and quantity of engineering resources

Google makes it look easy sometimes because they have much of the best talent in the world

Data Silos are a primary bottleneck

Viz tools are democratizing analysis - D3.org, Tableau, Spotfire, etc

“Big Data Mania” represents an opportunity to re-architect for flexibility + agility

Monolithic, hard-coded warehouses & ETL constrain experimentation, collaboration and agility Entities do not have perfect definitions - don’t try to force it...

Static schemas/data structures are great for collection but have “drag coefficient” for analytics

Embrace data variety as a reality - leave the monolithic vendors pretending they can lock us in Semantic approaches allow access to diverse data and agile integration to solve specific questions Data marts should be available “on demand” using tech “@ target” that suits the analytic

“You can’t get there from here” : NOT Enterprise Data “Business as Usual”

Part of the Answer:

The 3 V’s?

important but...

…not enough...

Try this … Start with the Questions, not the Answer - “Analytic context will set you free”

● Ask aspirational/transformational analytic questions ● Use them as context for defining all the work you do● Build your infrastructure to answer the analytical questions

In the process….● Get a broad and dynamic inventory of all your data● Match workload to appropriate engine/tech● Use Distributed Systems - radically lower cost vs. traditional● Expect modern and dynamic visualization - iterative vs. reporting● Treat Cloud as a first-order resource - not just ancillary● Modern DevOps - core capability● JSON sources will proliferate...embrace it● Bottom-up data/metadata management● Internal and external data - both valuable but not same

Start with the Questions, not the Answer…. ….but sometimes it’s not simple…....embrace the ambiguity...

Same but Different - Identity depends on the question:

● Gleevec, Glivec and Imatinib● Same INCHI Key● Formulation vs. Substance● Product versus compound● Regional naming difference● Canonicalization depends on context

InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,14-17,20H2,1-2H3,(H,32,37)(H,31,33,34)

Pick a problem that is:

- greenfield- well-defined- valuable

DO NOT BOIL THE OCEAN

Great Viz has never been more accessible

Distributed Systems

For data science at scale, we can’t afford to pay the “enterprise IT tax”

Need to build an enterprise infrastructure as inexpensive, scalable and persistent as that of modern web companies

Mindset: Put tight spending limits on storage and systems infrastructure … and it will take you toward a place similar to the modern internet consumer companies - this is a good place :)

Facebook CIO talking about Vertica

The Cloud

A first-level citizen in the enterprise infrastructure

Fact...not opinion: The world’s largest high-performance computing and persistence infrastructure is available for you to rent on-demand

Let’s drop the hubris of on-prem enterprise data centers much like we don’t generate our own electricity anymore….

DevOps

DevOps matters as much for data as for software

DevOps is to the Cloud as Systems Management was to Client-Server computing

● Couldn’t live without Systems Management then● Can’t live without DevOps now

Getting to scale (managing hundreds/thousands of machines) on demand requires automated tools and a modern DevOps infrastructure.

JSON

JSON is now a primary tool to access data

Ultimate evolution of relational and object-oriented technologies coming together

Provides a loose, flexible coupling between data access and applications

Definition of flexibility: As long as it’s JSON, we don’t need to care what’s behind it

Variety - how to tackle the enterprise data silo problem

Standardization and Aggregation are necessary but not sufficient to solve the challenges of Enterprise

Analytics 3.0

Bottom-Up + Top Down Data Modeling & “Collaborative Curation”

Time to embrace the reality of extreme data variety across the entire enterprise - “Unified Data”

Requires a bottom-up, probabilistic approach to data curation and integration (compliment deterministic)

● mix of 80% probabilistic & 20% deterministic● Tamr’s primary design pattern

Back to the future:● 1990’s web: probabilistic search and website connection● 2020’s enterprise: probabilistic data source connection &

curation

Internal and External Data

Internally and externally generated data are now BOTH important

If our orgs are going to become truly data- driven, we have to embrace external data

We need to get to the point that, a la Google, we don’t care where it comes from

Google Maps, for example ● Seamless integration of internal Google

and external data● And Google just doesn’t care

In Summary

● Manage your information as an asset

● Start with a broad inventory of all your data

● Embrace ambiguity/variety of enterprise data

● Throw the “one schema to rule them all” into the fires of Mordor…

● Embrace modern viz & iterative analytics

● Don’t ignore the Cloud - it’s inevitable

● DevOps is cool - and fun :)

● JSON is the future of data access - it’s ok

● True shared nothing distributed systems are the only way out of the “Enterprise IT Tax”

Discussion