h2o world - ml could solve nlp challenges: ontology management - erik huddleston

Discovering Ontologies in PR Big DataNovember 2015

November 2015 2

Introduction

TrendKite is dedicated to helping PR professionals and agency teams quantify and optimize the impact of their PR efforts.• Based in Downtown Austin, TX• Raised $20.6 Million in Funding• Hundreds of Marquee Clients Including: Nike, Hershey, Pinterest, and Memphis Grizzlies• Over 75 Employees and Growing Fast!

Erik Huddleston CEO at TrendKite

Steve Vaughan Senior Software Architect at TrendKite

November 2015 3

Solving PR Big Data Problem

TrendKite tracks a comprehensive set of metrics to accurately measure the impact PR is having on a customer's brand, website traffic, and business goals. Create beautiful automated or customizable dashboards that can quickly filter through billions of pieces of data in seconds to help streamline PR workflow.

Our platform analyzes: • Over 1.2 Billion Articles • Almost 2 Million Publishers • 56 Languages

November 2015 4

The Problem with PR Big Data

Complexity in the media landscape is accelerating at an alarming rate: • Print is dead, making the cost of production and ease of distribution trivial • Proliferation of media sources has created millions of highly targeted media outlets with very

specific topical influence and audience, while digital formats have increased dramatically • Lifecycle's of media outlets are often measured in weeks and months with hundreds of sources

created and destroyed daily

This makes it incredibly difficult to extract metadata from millions of unstructured online news publications that are vital in understanding PR campaign success such as: • Date of Publication • Author Information • Total Readership

• Total Mentions • Share of Voice • Key Messages

November 2015 5

Ontologies in PR Industry

Ontologies are models of entities, their properties, and help define relationships amongst them. They can enable PR and marketing professionals to harness semantic search and uniquely locate relevant news articles that are specific to their brand and exclude undesirable hits that standard keyword searches would include.

November 2015 6

Leveraging PR Ontologies

PR Industry Ontology Enables: • Finding Target the company versus the ocean of not-the-

company targets • Dealing with that company, brand, etc. named after a stop

word • Automated discovery of important, relevant news articles -

even the ones that don’t explicitly match human sourced keywords.

• Outbound publishing optimization • Identification of significant and relevant publications, authors,

executives, etc. • Benchmark the effectivity of authors and publications on a

topic-by-topic basis.

November 2015 7

Located the Obvious Metadata StandardsFirst, we took the low hanging fruit using industry standards to help target similarities across the web and extract relevant PR content such as :

Open Graph - http://ogp.me/ A Facebook-originated open standard to mark pages with types (article, music, video, etc.) and properties (author, duration, actors, etc.) Ex: <meta name="article.published" content="2015-10-31T09:30:00.000Z" />

Dublin Core - http://dublincore.org/ The Dublin Core Schema is a small set of vocabulary terms that can be used to describe web resources (video, images, web pages, etc.) Ex: <meta name="dc.date" content="2015-10-31 09:30:00" />

Schema.org - https://schema.org/ Open, community-led vocabulary sponsored by Google, Microsoft, Yahoo, and Yandex. Ex: <meta itemprop="datePublished" content="2015-10-31T09:30:00.000Z" />

http://ogp.me/

November 2015 8

Created a Foundation to Scale

Top-tier online news publishers follow standards, but the bigger challenge is determining how to track millions of other, smaller news and blog sources. To better track non-standardized publications, we built a universal decision tree to help us:

• Traverse raw HTML to seek out relevant data • Increase coverage and accuracy • Pick best-match data by assigning probability scores to each piece of extracted data • Probability influenced by

Text size, position, format Tag and attribute lineage Proximity to significant features (author, title, etc.)

November 2015 9

Next Steps: Specializing and Testing New Theories

Today there is a manageable set of CMS tools that produce HTML from other content to create news articles. Pages produced by these tools have almost identical structures that we can target. We’re ready to expand beyond one massive decision tree and create something new.

Under this theory we can make it easier to extract and archive relevant PR metadata by: • Classify incoming news articles by structural similarity • Use that classification to select a specialized decision tree unique to that class • Because the tree is purpose-built to recognize and extract metadata from a particular structure class, we reap

higher accuracy and performance

November 2015 10

Gratuitous Plug: We’re hiring! ;-)

http://www.trendkite.com/careers

Thanks!