how conviva used spark to speed up video analytics by 25x dilip antony joseph (@ dilipantony )
DESCRIPTION
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ). Conviva monitors and optimizes online video for premium content providers. We see 10s of millions of streams every day. Conviva data processing architecture. Monitoring Dashboard. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/1.jpg)
How Conviva used Spark to Speed Up Video Analytics by 25xDilip Antony Joseph (@DilipAntony)
![Page 2: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/2.jpg)
Conviva monitors and optimizes online video for premium content providers
We see 10s of millions of streams every day
![Page 3: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/3.jpg)
Conviva data processing architecture
Hadoop for historical data
Live data processing
Spark
Reports
Ad-hoc analysis
Video Player
Monitoring Dashboard
![Page 4: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/4.jpg)
Group By queries dominate our workload
• 10s of metrics, 10s of group bys• Hive scans data again and again from HDFS Slow
• Conviva GeoReport took ~ 24 hours using Hive
SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ’GROUP BY videoName;
![Page 5: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/5.jpg)
Group By queries can be easily written in Sparkval sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest }
val cachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cacheval mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }
val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
![Page 6: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/6.jpg)
Spark is blazing fast!• Spark keeps sessions of interest in RAM • Repeated group by queries are very fast.• Spark-based GeoReport runs in 45 minutes
(compared to 24 hours with Hive)
![Page 7: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/7.jpg)
Spark queries require more code, but are not too hard to write
• Writing queries in Scala – There is a learning curve
• Type-safety offered by Scala is a great boon– Code completion via Eclipse Scala plugin
• Complex queries are easier to write in Scala than in Hive.– Cascading IF()s in Hive
![Page 8: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/8.jpg)
Challenges in using Spark• Learning Scala• Always on the bleeding edge – getting
dependencies right• More tools required
![Page 9: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/9.jpg)
Spark @ Conviva today• Using Spark for about 1 year• 30% of our reports use Spark, rest use Hive
– Analytics portal with canned Spark/Hive jobs
• More projects in progress– Anomaly detection– Interactive console to debug video quality issues– Near real-time analysis and decision making using Spark
• Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva
![Page 10: How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )](https://reader036.vdocuments.us/reader036/viewer/2022081514/56816404550346895dd5ab6e/html5/thumbnails/10.jpg)
We are [email protected]