startup safary | fight against robots with enbrite.ly data platform
TRANSCRIPT
![Page 1: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/1.jpg)
Fight against robots with enbrite.ly data platformJoe MÉSZÁROS
![Page 2: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/2.jpg)
Joe MÉSZÁROSlead software engineer
@joemesz
joemeszaros
![Page 3: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/3.jpg)
Who we are?
Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
![Page 4: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/4.jpg)
Ad display fraud (ad stacking, pixel stuffing)
Ad viewability
![Page 5: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/5.jpg)
Brand safetyDetecting traffic that comes from unwanted categories (e.g. adult), countries and single domains
![Page 6: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/6.jpg)
39%
39%Anti fraud detection
![Page 7: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/7.jpg)
![Page 8: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/8.jpg)
DATA COLLECTION
ANALYZEDATA PROCESSION
ANTI FRAUDVIEWABILITY
BRAND SAFETYREPORT + API
What we do?
![Page 9: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/9.jpg)
How we do?
DATA PLATFORM
...so we need do analyze vast amount of data
Infrastucture Big Data technologies
+ enbrite.lydata
platform=
![Page 10: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/10.jpg)
Amazon Web Services (AWS)
● Most popular cloud service provider● ~70 services, 13 geographical
"regions"● Amazon Big Data = Elastic Map
Reduce● BUT Do not trust the BIG guy (API
problem)https://aws.amazon.com/
![Page 11: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/11.jpg)
Apache Hadoop
● de facto Big Data technology● open source software● distributed storage (HDFS) + data
processing (MapReduce)● ecosystem: many additional
softwareshttp://hadoop.apache.org/ | https://github.com/apache/hadoop
![Page 12: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/12.jpg)
Apache Spark
● large-scale data processing engine● open source software (popular)● modules: core, sql, sreaming, graph,
ML● faster than Hadoop MapReduce
http://spark.apache.org/ | https://github.com/apache/spark
![Page 13: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/13.jpg)
Data platform in numbers
20+ node cluster
16 services
110 servers
0.5 - 4 TB /day100+ TB on
S3
![Page 14: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/14.jpg)
How we do?
DATA COLLECTION
![Page 15: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/15.jpg)
How we do?
DATA PROCESSION
![Page 16: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/16.jpg)
Let me tell you a short story...
![Page 17: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/17.jpg)
Real world exampleYou have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
![Page 18: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/18.jpg)
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined timeframe.
INPUT: Collected events on Amazon S3OUTPUT: Invalid sessions
![Page 19: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/19.jpg)
Step 1: sessionize events
How to solve it?
Step 2: detect too many clicks
code: https://github.com/enbritely/startup-safary
![Page 20: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/20.jpg)
Step 1: event to session//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
Application code : https://github.com/enbritely/startup-safary
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
JavaPairRDD<String, List<Event>> grouped = clicks
.groupBy(Event::sessionId);
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
JavaPairRDD<String, List<Event>> grouped = clicks
.groupBy(Event::sessionId);
JavaRDD<Session> sessions = grouped.mapValues(sessionizer);
![Page 21: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/21.jpg)
Step 1: event to session//Sessionizer
(Function<Iterable<Event>, Session>) unorderedEvents -> {
List<Event> clickOrdered = sortyByTimestamp(unorderedEvents);
Session session = new Session(sessionId);
for (Event event: clickOrdered) {
session.addClick(event.getTimestamp());
}
return session;
}
Application code : https://github.com/enbritely/startup-safary
![Page 22: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/22.jpg)
Step 2: apply heuristic
Application code : https://github.com/enbritely/startup-safary
JavaRDD<String> badSessions = sessions
.filter(s -> s.getClickCount() > threshold)
.map(s -> s.sessionId + ":" + s.clickCount);
// save output to HDFS
![Page 23: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/23.jpg)
Live demo!
● 4 node EMR (Hadoop) Cluster
● Apache Spark 1.6.1● 1 GB input events
build app : create-cluster : events S3 -> HDFS : submit app
![Page 24: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/24.jpg)
Congratulation!MISSION COMPLETED
YOU just saved the world with a simple idea within ~10
minutes.
![Page 25: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/25.jpg)
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
![Page 26: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/26.jpg)
WE ARE HIRING!
… is our mood manager, Bigyó :)
![Page 27: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/27.jpg)
BEYOND enbrite.ly
...our investor and event sponsor is looking for talented guys
![Page 28: Startup Safary | Fight against robots with enbrite.ly data platform](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eed8141a28ab31108b45cd/html5/thumbnails/28.jpg)
Joe MÉSZÁROSlead software [email protected]
@joemesz @enbritely
joemeszarosenbritely
THANK YOU!
?QUESTIONS?