[srijan wednesday webinars] from data management to data analysis pipelines

20
From Data Management to Data Analysis Pipelines Open source based architectures to get the job done Young-Jin Kim [email protected]

Upload: srijan-technologies

Post on 15-Jul-2015

349 views

Category:

Data & Analytics


5 download

TRANSCRIPT

From Data Management to Data Analysis Pipelines

Open source based architectures to get the job done

Young-Jin [email protected]

What will be covered@srijan #SrijanWW

1. Common Data Management Problems in NPOs and NGOs

2. Obstacles for Data Analysis rooted in Data Management Practices

3. Start with "Why?" not with "How?"

4. You have more data than you think

5. How to miss the on-ramp to Data Analysis by going top-down

6. Getting to Data Analysis from the ground-up using pipelines

7. Some useful data infrastructure architectures

8. Some useful tools to build the data pipelines

9. Questions

Data is the New Oil

collectextractrefineexploit

top-down mechanistic view

@srijan #SrijanWW

If "Data is the New Oil"

Then most of the NPO/NGO sector is pumping it by hand

and isn't refining it from crudeto create greater value.

@srijan #SrijanWW

Everyday Data Battles of NPOs/NGOs@srijan #SrijanWW

Similar pain points across many types of NPOs and NGOs when faced with managing mission critical data on programs, clients, donors and volunteer:

● Adhoc, non-uniform data collection tools across organization

● Managing data is time consuming and inefficient

● Difficulty tracking NPO staff-client interactions over time

● Organization has high turnover in staff/volunteers/clients

● People can not update their own contact information

● Missing linkage between real world entities due to duplicates

● Problems syncing data between local on-the-ground efforts

and national umbrella organization

Data Silos → Blocked Data Flows

Don

or F

undr

aisi

ng S

oftw

are

Pro

gram

Dat

a S

prea

dshe

ets

Eve

nt T

icke

ting

Sys

tem

Con

tact

s D

atab

ase

/ CR

M

Web

& E

mai

l Mar

ketin

g

HowNPOs

Manage Data

...directly leads to...

isolateddata silos

impossible toperform

data analysis

Obstacles to Data AnalysisOperational Data Stores (ODS) without proper governance, integration and tools will lack data flows and pose serious obstacles to data analysis for the organization.

● ODS → data silos → blocked data flows

○ Missing integrations into unified Database of Record

○ Weak or Missing Data Governance Rules and Policies

● Data Quality Issues in each ODS

○ Lack of Data Hygiene and Quality Assurance Policies

○ Missing Entity Resolution within ODS and across ODSs

● No Data Strategy leads to adhoc tactical technology stopgaps

"We need the new proprietary XYZ system now, we will work

out if/how XYZ integrates with our current systems later..."

@srijan #SrijanWW

Top-down favors "How?" not "Why?"@srijan #SrijanWW

Which leader of an organization doesn't want the latest and

greatest, fashionable "How?" answers, buzz words or products:

● Data Lake to replace the Enterprise Data Warehouse

● Hadoop/Spark Cluster for Streaming Big Data Processing

● Predictive Analytics Platform for Decision Support

● Drag-and-drop self-service visualizations and drill-downs

● Business Intelligence Platform with A/B testing

Start with "Why?" to avoid "cargo cult" data science which is

usually due to top-down mandates by leadership to become

more of a data-driven organization. Putting in place all the

"How?" answers and systems never fully answers the "Why?"

Dangers of "How?" ahead of "Why?"@srijan #SrijanWW

ceci n'est pas un phone.

Fallacy of "cargo cult" data science

Invest and build the latest-greatest data

systems and the rich insights and data driven

decision making will spew forth from the

systems in deus ex machina style.

Data Pipelines and Food Preparation@srijan #SrijanWW

Raw Data Software Systems Insights

unwieldy Clean, refine, transform Actionable

Raw Ingredients Cooking Techniques Delicious Dish

Inedible Clean, cut, prepare Enjoyable

You likely have more data than you think@srijan #SrijanWW

Take a Data inventory:

● the "obvious" data sources: what you're probably collecting already (say, what's in your CRM, event attendance lists)

● the less-obvious data sources: ○ not collecting something you could: leaving the data on

the floor (data exhaust)○ collecting something, but then throwing it away:

webserver logs● don't collect everything

○ over time you may even forget why it's there (or why it's important) making cleanup difficult

○ the less data you store means lower risk exposure if there is a break-in

Pathways to Data Analysis

Master Data Management (MDM)consists of processes, governance, policies, standards and tools

that consistently define and manage the critical data of an

organization to provide a single point of reference in a Database

of Record (DBOR)

Master data management has the objective of providing processes for collecting, aggregating,

matching, consolidating, quality-assuring, persisting and distributing such data throughout an

organization to ensure consistency and control in the ongoing maintenance and application use

of this information.

http://en.wikipedia.org/wiki/Master_data_management

@srijan #SrijanWW

"Fail Slow" use the top-down approachImplementing MDM in multi-year top-down project with full

requirements gathering in a water-fall based approach will fail:

● takes a long time → very expensive

● weak support within organization: perceived value is low

● project's scope keeps shifting and thus is never done

● new systems and bad data sets are added as project

progresses, never finishes

● insights are slow coming since data analysis follows full MDM

implementation

● MDM-first-approach = analysis paralysis, never have all the

information to know what measure is valuable where

@srijan #SrijanWW

“Data is the new oil? No: Data is the new soil.”

– David McCandles

lay seedsgrowingstewardshipharvesting fruits

bottom-up organic view

@srijan #SrijanWW

Doing Data Analysis from the bottom-upImplementing MDM from the bottom-up in an agile, iterative

process is preferred. Incremental refinement of data into an

eventual MDM is more powerful, here's why:

● faster insights from data by harvesting low hanging fruit

● grow support within organization: perceived value increases

● project is work in progress, so iterative nature is understood

● new systems and bad data sets are added and requirements

shift, both are handled incrementally

● insights steadily improve over time and so does the data

analysis as eventual MDM implementation nears full MDM

● MDM-eventually-approach allows for the organization's

analytical capabilities to grow, also more cost-effective

@srijan #SrijanWW

Data Architectures and Best Practices@srijan #SrijanWW

Golden Record with Incremental Data Refining

Operationalize Data Insights early and often, which in turn

incrementally aligns organization around better data practices

and organically builds data governance structures and policies.

ProgramData

EventsDB CRM

GoldenRecord

CMSDonorDB

Incremental ETLs with Cleansing

Dedupe, record linkage

Open Source Tools: build data pipelines@srijan #SrijanWW

OpenDataKit collect survey data on mobile devices

Drupal CMS widely adopted by the NPO/NGO community

CiviCRM open source CRM for the NPO/NGO sector

Pentaho Data Integration powerful open source ETL tool

OpenRefine for data cleansing

Python Dedupe Library for entity resolution

Knime Analytics Platform Machine Learning platform

Python Analytics Stack (ipython + Pandas + scikit-learn)

R-Studio R-language IDE for statistical analysis and visualization

DC.JS Dimensional Charting Visualizations (d3 + crossfilter)

Elasticsearch, Neo4j, MongoDB, Hadoop, Spark, PostGIS etc.

Open Source based Data Architecture@srijan #SrijanWW

ProgramData

TicketingDB

CiviCRMCRM

GoldenRecord + Rest API

Drupal CMSRaiser's

EdgeDB

Incremental ETLs with Cleansing

Dedupe, record linkage

Data AnalysisVisualizations

Machine Learning

Young-Jin [email protected]

Thank You!

Take this conversation online by tweeting using the hashtag #SrijanWW