h2o world - ml could solve nlp challenges: ontology management - erik huddleston
TRANSCRIPT
November 2015 2
Introduction
TrendKite is dedicated to helping PR professionals and agency teams quantify and optimize the impact of their PR efforts.• Based in Downtown Austin, TX• Raised $20.6 Million in Funding• Hundreds of Marquee Clients Including: Nike, Hershey, Pinterest, and Memphis Grizzlies• Over 75 Employees and Growing Fast!
Erik Huddleston CEO at TrendKite
Steve Vaughan Senior Software Architect at TrendKite
November 2015 3
Solving PR Big Data Problem
TrendKite tracks a comprehensive set of metrics to accurately measure the impact PR is having on a customer's brand, website traffic, and business goals. Create beautiful automated or customizable dashboards that can quickly filter through billions of pieces of data in seconds to help streamline PR workflow.
Our platform analyzes: • Over 1.2 Billion Articles • Almost 2 Million Publishers • 56 Languages
November 2015 4
The Problem with PR Big Data
Complexity in the media landscape is accelerating at an alarming rate: • Print is dead, making the cost of production and ease of distribution trivial • Proliferation of media sources has created millions of highly targeted media outlets with very
specific topical influence and audience, while digital formats have increased dramatically • Lifecycle's of media outlets are often measured in weeks and months with hundreds of sources
created and destroyed daily
This makes it incredibly difficult to extract metadata from millions of unstructured online news publications that are vital in understanding PR campaign success such as: • Date of Publication • Author Information • Total Readership
• Total Mentions • Share of Voice • Key Messages
November 2015 5
Ontologies in PR Industry
Ontologies are models of entities, their properties, and help define relationships amongst them. They can enable PR and marketing professionals to harness semantic search and uniquely locate relevant news articles that are specific to their brand and exclude undesirable hits that standard keyword searches would include.
November 2015 6
Leveraging PR Ontologies
PR Industry Ontology Enables: • Finding Target the company versus the ocean of not-the-
company targets • Dealing with that company, brand, etc. named after a stop
word • Automated discovery of important, relevant news articles -
even the ones that don’t explicitly match human sourced keywords.
• Outbound publishing optimization • Identification of significant and relevant publications, authors,
executives, etc. • Benchmark the effectivity of authors and publications on a
topic-by-topic basis.
November 2015 7
Located the Obvious Metadata StandardsFirst, we took the low hanging fruit using industry standards to help target similarities across the web and extract relevant PR content such as :
Open Graph - http://ogp.me/ A Facebook-originated open standard to mark pages with types (article, music, video, etc.) and properties (author, duration, actors, etc.) Ex: <meta name="article.published" content="2015-10-31T09:30:00.000Z" />
Dublin Core - http://dublincore.org/ The Dublin Core Schema is a small set of vocabulary terms that can be used to describe web resources (video, images, web pages, etc.) Ex: <meta name="dc.date" content="2015-10-31 09:30:00" />
Schema.org - https://schema.org/ Open, community-led vocabulary sponsored by Google, Microsoft, Yahoo, and Yandex. Ex: <meta itemprop="datePublished" content="2015-10-31T09:30:00.000Z" />
November 2015 8
Created a Foundation to Scale
Top-tier online news publishers follow standards, but the bigger challenge is determining how to track millions of other, smaller news and blog sources. To better track non-standardized publications, we built a universal decision tree to help us:
• Traverse raw HTML to seek out relevant data • Increase coverage and accuracy • Pick best-match data by assigning probability scores to each piece of extracted data • Probability influenced by
Text size, position, format Tag and attribute lineage Proximity to significant features (author, title, etc.)
November 2015 9
Next Steps: Specializing and Testing New Theories
Today there is a manageable set of CMS tools that produce HTML from other content to create news articles. Pages produced by these tools have almost identical structures that we can target. We’re ready to expand beyond one massive decision tree and create something new.
Under this theory we can make it easier to extract and archive relevant PR metadata by: • Classify incoming news articles by structural similarity • Use that classification to select a specialized decision tree unique to that class • Because the tree is purpose-built to recognize and extract metadata from a particular structure class, we reap
higher accuracy and performance