sparkler - event schedule & agenda builder app |...

29
Nov 15 th 2016 @ Apache Big Data EU 2016, Seville, Spain Thamme Gowda Karanjeet Singh SPARKLER Information Retrieval and Data Science Chris Mattmann

Upload: phungdat

Post on 18-Feb-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Nov 15th 2016@ Apache Big Data EU 2016, Seville, Spain

Thamme GowdaKaranjeet Singh

SPARKLER

Information Retrieval and Data Science

Chris Mattmann

Page 2: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

ABOUT: USC INFORMATION RETRIEVAL AND DATA SCIENCE GROUP

● Established in August 2012 at the University of Southern California (USC)

● Dr. Chris Mattmann, Director of IRDS and our Advisor

● Funding from NSF, DARPA, NASA, DHS, private industry and other agencies - in collaboration with NASA JPL

● 3 Postdocs, and 30+ Masters and PhD students, 20+ JPLers past 7 years

● Recent topical research in the DARPA XDATA/MEMEX program

Information Retrieval and Data Science

Email : [email protected] Website : http://irds.usc.edu/GitHub : https://github.com/USCDataScience/

Page 3: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

ABOUT: US

Karanjeet SinghGraduate Student at the University of Southern California, USAResearch Interest: Information Retrieval & Natural Language ProcessingResearch Affiliate at NASA Jet Propulsion LaboratoryCommitter and PMC member of Apache Nutch

Information Retrieval and Data Science

Thamme GowdaGraduate Student at the University of Southern California, USAResearch Intern at NASA Jet Propulsion Laboratory, Co Founder at DatoinResearch Interest: NLP, Machine Learning and Information RetrievalCommitter and PMC member of Apache Nutch, Tika, and Joshua (Incubating)

Dr. Chris MattmannDirector & Vice Chairman, Apache Software FoundationResearch Interest: Data Science, Open Source, Information Retrieval & NLPCommitter and PMC member of Apache Nutch, Tika, (former) Lucene, OODT, Incubator

Page 4: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

OVERVIEW

● About Sparkler

● Motivations for building Sparkler

● Quick intro to Apache Spark

● Sparkler technology stack, internals

● Features of Sparkler

● Comparison with Nutch

● Going forward

Information Retrieval and Data Science

Page 5: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

ABOUT: SPARKLER

● New Open Source Web Crawler

○ A bot program that can fetch resources from the web

● Name: Spark Crawler

● Inspired by Apache Nutch

● Like Nutch: Distributed crawler that can scale horizontally

● Unlike Nutch: Runs on top of Apache Spark

● Easy to deploy and easy to use

Information Retrieval and Data Science

Page 6: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

MOTIVATION #1

● Challenges in DARPA MEMEX○ Intro: MEMEX System has crawlers to fetch deep and

dark web data for assisting law keeping agencies○ Crawls are kind of blackbox, we wanted real-time

progress reports● Dr. Chris Mattmann was considering an upgrade since 3

years● Technology upgrade needed

Page 7: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

https://twitter.com/cutting/status/796566255830503424

Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it!

WHY A NEW CRAWLER?

Page 8: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

MOTIVATION #2

● Challenges at DATOIN○ Intro: Datoin is a distributed text analytics platform○ Late 2014 - migrated the infrastructure from Hadoop

Map Reduce to Apache Spark○ But the crawler component (powered by Apache Nutch)

was left behind

● Met Dr. Chris Mattmann at USC in Web Search Engines class○ Enquired about his thoughts for running Nutch on Spark○ Agreed to work on it.

Page 9: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

● High performance & Fault tolerance● Real time crawl analysis● Easy to customize

Is the food ready?

How is it going?

I want less salt.

Information Retrieval and Data Science

KEY FEATURES

Page 10: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

APACHE SPARK: OVERVIEW

● Introduction

● Resilient Distributed Dataset (RDD)

● Driver, Workers & Executors

Information Retrieval and Data Science

Page 11: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

APACHE SPARK: INTRODUCTION

● Fast and general engine for large scale data processing

● Started at UC Berkeley in 2009

● The most popular distributed computing framework● Provides high level APIs in Scala, Java, Python, R● Integration with Hadoop and its ecosystem

● Open sourced in 2010 under Apache v2.0 license● Mattmann helped to bring Spark to Apache under

DARPA XDATA effort

Information Retrieval and Data Science

Page 12: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Resilient Distributed Dataset (RDD)

● A basic abstraction in Spark● Immutable, Partitioned collection of elements operated in parallel● Data in persistent store (HDFS, Cassandra) or in cache (memory, disk)● Partitions are recomputed on failure or cache eviction● Two classes of operations

○ Transformations○ Actions

● Custom RDDs can also be implemented - we have one!

Information Retrieval and Data Science

Page 13: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

Driver, Workers & Executors

* Photo credit - spark.apache.org

Page 14: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER: TECH STACK

● Batch crawling (similar to Apache Nutch)● Apache Solr as crawl database● Multi module Maven project with OSGi bundles● Stream crawled content through Apache Kafka● Parses everything using Apache Tika● Crawl visualization - Banana

Information Retrieval and Data Science

Page 15: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER: INTERNALS & WORKFLOW

Information Retrieval and Data Science

Page 16: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER: FEATURES

Information Retrieval and Data Science

Page 17: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

● Crawldb needed indexing○ For real time analytics○ For instant visualizations

● This is internal data structure of sparkler○ Exposed over REST API○ Used by Sparkler-ui, the web application

● We chose Apache Solr● Standalone Solr Server or Solr Cloud?● Glued the crawldb and spark using CrawldbRDD

SPARKLER #1: Lucene/Solr powered Crawldb

Information Retrieval and Data Science

Page 18: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #2: Partitioning by host

Information Retrieval and Data Science

● Politeness* Doesn’t hit same server too many times in distributed mode

● First version○ Group by: Host name○ Sort by: depth, score

● Customization is easy○ Write your own Solr query○ Take advantage of boosting to alter the ranking

● Partitions the dataset based on the above criteria● Lazy evaluations and delay between the requests

■ Performs parsing instead of waiting■ Inserts delay only when it is necessary

Page 19: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #3: OSGI Plugins

Information Retrieval and Data Science

● Plugins Interfaces are inspired by Nutch● Plugins are developed as per Open Service Gateway

Interface (OSGI)● We chose Apache Felix implementation of OSGI● Migrated a plugin from Nutch

○ Regex URL Filter Plugin → The most used plugin in Nutch

● Added JavaScript plugin (described in the next slide)● //TODO: Migrate more plugins from Nutch

○ Mavenize nutch [NUTCH-2293]

Page 20: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #4: JavaScript Rendering

Information Retrieval and Data Science

● Java Script Execution* has first class support● Distributable on Spark Cluster without pain

○ Pure JVM based JavaScript engine● This is an implementation of FetchFunction● FetchFunction

○ Stream<URL> → Stream<Content>○ Note: URLS are grouped by host○ It preserves cookies and reuses sessions for each iteration

Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers

Page 21: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #5: Output in Kafka Streams

Information Retrieval and Data Science

● Crawler is sometimes input for the applications that does deeper analysis○ Can’t fit all those deeper analysis into crawler

● Integrating to such applications made easy via Queues● We chose Apache Kafka

○ Suits our need■ Distributable, Scalable, Fault Tolerant

● FIXME: Larger messages such as Videos● This is optional, default output on Shared File System

(such as HDFS), compatible with Nutch

* Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL)

Page 22: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #6: Tika, the universal parser

Information Retrieval and Data Science

● Apache Tika

○ Is a toolkit of parsers

○ Detects and extracts metadata, text, and URLS

○ Over a thousand different file types

● Main application is to discover outgoing links

● The default Implementation for our ParseFunction

Page 23: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #7: Visual Analytics

Information Retrieval and Data Science

● Charts and Graphs provides nice summary of crawl job● Real time analytics● Example:

○ Distribution of URLS across hosts/domains○ Temporal activities○ Status reports

● Customizable in real time● Using Banana Dashboard from Lucidworks● Sparkler has a sub component named sparkler-ui

* Thanks to : Manish Dwibedy MS CS University of Southern California

Page 24: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

SPARKLER #Next: what’s coming?

Information Retrieval and Data Science

● Interactive UI● More plugins● Scoring Crawled Pages● Focussed Crawling● Crawl Graph Analysis● Domain Discovery (another research challenge)● Other useful plugins from Nutch● Detailed documentation and tutorials on wiki

Page 25: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Nutch ConfigurationVersion : 1.12topN : 50,000Fetcher Thread : 1

Hadoop ConfigurationVersion : 2.6.0-cdh5.8.2Slaves : 2Memory : 8G (Map), 16G (Reduce)22 Mappers, 11 reducers

HOW FAST IT RUNS - Comparison with Nutch

Information Retrieval and Data Science

Crawl Iterations : 5Fetch Delay : 1 sec

Sparkler ConfigurationVersion : 0.1-SNAPSHOTtopGroups : 252topN : 1000

Spark ConfigurationVersion : 1.6.1 with Scala v2.11Slaves : 222 Worker Instances with 210G memory

Page 26: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

DIVERSIFIED - Comparison with Nutch

Page 27: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

Sparkler Dashboard

Page 28: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

Information Retrieval and Data Science

SPARKLER IS COMING TO APACHE

proposal later this week!

Look for

Page 29: SPARKLER - Event Schedule & Agenda Builder App | Schedschd.ws/hosted_files/apachebigdataeu2016/e0/Sparkler - ApacheCon … · SPARKLER Information Retrieval and Data Science Chris

● Get involved with our journey of Incubator

● Get started: Checkout README and wiki at

https://github.com/USCDataScience/sparkler

Information Retrieval and Data Science

Questions?

THANK YOU