Download - Analyzing Log Data with Spark and Looker
May 18, 2016
Analyzing Log Data with Spark and Looker
Housekeeping
• We will do Q&A at the end.
• You should see a box on the right
side of your screen.
• There is a button marked “Q&A” on
the bottom menu.
• We will be recording this
• We will send you the recording
tomorrow.
• We will also send you the slides
tomorrow.
RecordingQ&A
Scott HooverData Scientist
Looker
Daniel MintzChief Data Evangelist
Looker
Meet Our Presenters
4
AGENDA
1.2.3.
Overview
Quick intro to Looker
Creating a pipeline with Spark
4.5.6.
Analyzing Log Data
Deriving insight from Log Data
Q&A
Looker makes it easy for everyone
to find, explore and understand
the data that drives your business.
2
THE TECHNICAL PILLARS THAT MAKE IT POSSIBLE
100% In Database
Leverage all your dataAvoid summarizing or
moving it
Modern Web Architecture
Access from anywhereShare and collaborate
Extend to anyone
LookML Intelligent Modeling Layer
Describe the dataCreate reusable and
shareable business logic
9
7
Creating a Pipeline with Spark
Who is this webinar for?
Creating a Pipeline with Spark
Who is this webinar for?
● You have some exposure to Spark.
Creating a Pipeline with Spark
Who is this webinar for?
● You have some exposure to Spark.
● You have log data (with an interest in processing it at scale).
Creating a Pipeline with Spark
Creating a Pipeline with Spark
Who is this webinar for?
● You have some exposure to Spark.
● You have log data (with an interest in processing it at scale).
● Terminal doesn’t terrify you.
Creating a Pipeline with Spark
Who is this webinar for?
● You have some exposure to Spark.
● You have log data (with an interest in processing it at scale).
● Terminal doesn’t terrify you.
● You know some SQL.
Alright—just in case: What is Spark?
Creating a Pipeline with Spark
Alright—just in case: What is Spark?
● In-memory, distributed compute framework.
Creating a Pipeline with Spark
Alright—just in case: What is Spark?
● In-memory, distributed compute framework.
● Drivers to access Spark: spark-shell, JDBC, submit application jar.
Creating a Pipeline with Spark
Creating a Pipeline with Spark
Alright—just in case: What is Spark?
● In-memory, distributed compute framework.
● Drivers to access Spark: spark-shell, JDBC, submit application jar.
● Core concepts: ○ Resilient distributed dataset (RDD)○ Dataframe○ Discretized Stream (DStream)
Why is Spark of interest to us?
Creating a Pipeline with Spark
Why is Spark of interest to us?
● APIs galore: Java, Scala, Python and R.
Creating a Pipeline with Spark
Why is Spark of interest to us?
● APIs galore: Java, Scala, Python and R.
● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX
Creating a Pipeline with Spark
Why is Spark of interest to us?
● APIs galore: Java, Scala, Python and R.
● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX
● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC
Creating a Pipeline with Spark
Why is Spark of interest to us?
● APIs galore: Java, Scala, Python and R.
● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX
● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC
● Learning curve is relatively forgiving.
Creating a Pipeline with Spark
Why is Spark of interest to us?
● APIs galore: Java, Scala, Python and R.
● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX
● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC
● Learning curve is relatively forgiving.
● Looker can issue SQL to Spark via JDBC!
Creating a Pipeline with Spark
Let’s review a couple of proposed architectures.
Creating a Pipeline with Spark
Creating a Pipeline with Spark
Data Sources
Ingestion
Storage
Compute BI/Viz./Analytics
RPC/REST/Pull
RPC/Put
Raw
logs
/text
Par
quet
/OR
C
JDBC
Metastore
Stream
MapReduceRaw logs
extern
al
table
Creating a Pipeline with Spark
Data Sources
Ingestion
Storage
Compute BI/Viz./Analytics
RPC/REST/Pull
RPC/PutRaw logs
Par
quet
/OR
C
JDBC
Metastore
Stream
Stream
extern
al
table
Sidebar: Why the emphasis on Parquet and ORC?
Creating a Pipeline with Spark
Let’s step through the essential components of this pipeline.
Creating a Pipeline with Spark
# name the components of agentagent.sources = terminalagent.sinks = logger sparkagent.channels = memory1 memory2
# describe sourceagent.sources.terminal.type = execagent.sources.terminal.command = tail -f /home/hadoop/generator/logs/access.log
# describe logger sinkagent.sinks.logger.type = logger
# describe spark sinkagent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinkagent.sinks.spark.hostname = localhostagent.sinks.spark.port = 9988
# channel buffers events in memory (used with logger sink)agent.channels.memory1.type = memoryagent.channels.memory1.capacity = 10000agent.channels.memory1.transactionCapacity = 1000
# channel buffers events in memory (used with spark sink)agent.channels.memory2.type = memoryagent.channels.memory2.capacity = 10000agent.channels.memory2.transactionCapacity = 1000
# tie source and sinks with respective channelsagent.sources.terminal.channels = memory1 memory2agent.sinks.logger.channel = memory1agent.sinks.spark.channel = memory2
Creating a Pipeline with Spark
2.174.143.4 - - [09/May/2016:05:56:03 +0000]"GET /department HTTP/1.1" 200 1226"-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT
5.1; Trident/4.0; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2)" "USER=0;NUM=9"
Creating a Pipeline with Spark
scala> case class Foo(bar: String)
scala> val instanceOfFoo = Foo("i am a bar")
scala> instanceOfFoo.barres0 > String = i am a bar
Creating a Pipeline with Spark
case class LogLine ( ip_address: String, identifier: String, user_id: String, created_at: String, method: String, uri: String, protocol: String, status: String, size: String, referer: String, agent: String, user_meta_info: String )
Creating a Pipeline with Spark
val ip_address = "([0-9\\.]+)" // 2.174.143.4 val identifier = "(\\S+)" // - val user_id = "(\\S+)" // - val created_at = "(?:\\[)(.*)(?:\\])" // 09/May/2016:05:56:03 +0000 val method = "(?:\")([A-Z]+)" // GET val uri = "(\\S+)" // /department val protocol = "(\\S+)(?:\")" // HTTP/1.1 val status = "(\\S+)" // 200 val size = "(\\S+)" // 1226 val referer = "(?:\")(\\S+)(?:\")" // - val agent = "(?:\")([^\"]*)(?:\")" // Mozilla/4.0 ... val user_meta_info = "(.*)" // USER=0;NUM=9
val lineMatch = s"$ip_address\\s+$identifier\\s+$user_id\\s+$created_at\\s+$method\\s+$uri\\s+$protocol\\s+$status\\s+$size\\s+$referer\\s+$agent\\s+$user_meta_info".r
Creating a Pipeline with Spark
def extractValues(line: String): Option[LogLine] = { line match { case lineMatch(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info, _*) => return Option(LogLine(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info)) case _ => None } }
Creating a Pipeline with Spark
def main(argv: Array[String]): Unit = {
// create streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration))
// ingest data val stream = FlumeUtils.createPollingStream(ssc, host, port)
// extract message body from avro serialized flume events val mapStream = stream.map(event => new String(event.event.getBody().array()))
// parse log into LogLine class and write parquet to hdfs/s3 mapStream.foreachRDD{ rdd => rdd.map{ line => extractValues(line).get }.toDF() // convert rdd to dataframe with schema .coalesce(1) // consolidate to single partition .write // serialize .format("parquet") // write parquet .mode("append") // don’t overwrite files; append instead .save("hdfs://some.endpoint:8020/path/to/logdata.parquet") }
}
Creating a Pipeline with Spark
create external table logdata.event ( ip_address string , identifier string , user_id string , created_at timestamp , method string , uri string , protocol string , status int , size int , referer string , agent string , user_meta_info string)stored as parquetlocation 'hdfs://some.endpoint:8020/path/to/logdata.parquet';
Creating a Pipeline with Spark
Live demo of streaming pipeline.
Creating a Pipeline with Spark
Example streaming application can be found here:https://github.com/looker/spark_log_data
Creating a Pipeline with Spark
Analyzing Log Data with Looker Demo
11
Q&A
12
THANK YOU FOR JOINING
Recording and slides will be posted.
We will email you the links tomorrow.
See you next time!
Next from Looker Webinars: Data Stack Considerations:
Build vs. Buy at Tout on June 2
See how Looker works on your data.
Visit looker.com/free-trial or email [email protected].
THANK YOU!