tackling a 1 billion member social network
TRANSCRIPT
![Page 1: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/1.jpg)
Scala SWAT
Artur Bań[email protected]@abankowski
TACKLING A 1 BILLION MEMBER SOCIAL NETWORK
This is a story about our participation in one of our customer’s project
1
![Page 2: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/2.jpg)
Previous Experience
~10 millions records
~60 millions documents
~60 million vertices graph (non-commercial)
Datasets I have previously worked on were much smaller!
This time we were about to deal with a much bigger social network, which can easily be represented as a graph. We were going to join the team…
![Page 3: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/3.jpg)
Graph structureUser
User
User
Foo Inc.User
User
User
worked
worked
knows
knows
knows
knows
Bar Ltd.
workedknows
2 types of vertices 2 types of relations
Companies and Users
![Page 4: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/4.jpg)
User
User
User
Foo Inc.User
User
User
worked
worked
knows
knows
knows
knows
The concept: provide graph subset
Bar Ltd.
worked
Foo Inc.
User
User
User
User
worked
worked
knows
knows
knows
knows
knows
Fulltext searchable, sortable, filterable result for Foo Inc.
![Page 5: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/5.jpg)
1 billion challenge835,272,759
835,272,759 vertices
751,857,081 users
83,415,678 companies
6,956,990,209 relations
Subsets from thousands to millions users
When we finally put our hands on the data, we have realized that datasize is smaller than expected
What more have we found?
![Page 6: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/6.jpg)
Existing app workflow
One Engineer-EveningMultiple custom scripts
JSON files
extract subseteg: 3m profiles
750m profiles
data duplication
only manually selected60m incl. duplicates
Manual process, takes few days from
user perspective
This was already used by final customers
![Page 7: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/7.jpg)
PoC - The Goal
Handle 1 billion profiles automaticallyOur primary goal
![Page 8: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/8.jpg)
PoC - Definition of done
• Graph traversable on demand
• First results available under 1 minute
• Entire subset ready in few minutes
REST API with search
![Page 9: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/9.jpg)
PoC - Concept
Graph DBDocument
DB
API
6,956,990,209 relations
751,857,081 profiles
Whole dataset stored in two
engines
All relations
All profiles
![Page 10: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/10.jpg)
Why graph DB?
•Fast graph traversal
•Easily extendable with new relations and vertices
•Convenient algorithm description
![Page 11: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/11.jpg)
PoC - Flow
Graph DBDocument
DB
API
1. Generate subset2. Tag documents
3. Perform search with tag as a filter
Generate subset from graph with users and companies relations
Tag documents with unique id in database with all
profiles
Search with unique id as a primary filter
![Page 12: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/12.jpg)
Weapon of choice
? ?Graph DBDocument
DB
API
Scala and Play were natural choice for API app
Some research required for databases
![Page 13: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/13.jpg)
First Steps: Extraction
Extract anonymized profiles, companies
and relations
Cleanup data, sort and generate input
files
few days to pull, streaming with Akka and Slick
To make a research we had to put our hands on the real data
![Page 14: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/14.jpg)
First Steps: Pushing forward
Push profiles for searching purposes
Push vertices and relations for traversal
Document DB
Graph DB
two tools, highly dependent on db engines
![Page 15: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/15.jpg)
Fulltext Searchable Document DB• Mature
• Horizontally scalable
• Fast indexing (~3k documents per second on the single node)
• Well documented
• With scala libraries:
• https://github.com/sksamuel/elastic4s
• https://github.com/evojam/play-elastic4sWe already had significant experience with scaling Elastic Search for 80 millions
![Page 16: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/16.jpg)
Which graph DB?
Considered three engines, OrientDB and Neo4j in
community editions
![Page 17: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/17.jpg)
Goodbye Titan
• Performed well in ~60M tests
• fast traversing • Development has been put
on hold?
![Page 18: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/18.jpg)
• No batch insertion, slow initial load
• Stalled writes after 200 millions of vertices with relations
• Horror stories in the internet
![Page 19: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/19.jpg)
Neo4j FTWhttps://github.com/AnormCypher/AnormCypher
• Fast • Already known • Convenient in Scala with
AnormCypher • Offline bulk loading • Result streaming
![Page 20: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/20.jpg)
Weapon of choice
Graph DBDocument
DB
API
Final stack :)
![Page 21: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/21.jpg)
Final Setup on AWS
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge 4vCPU 30.5GB
i2.2xlarge 8vCPU 61GB
m4.large 2vCPU 8GB
• 2 nodes • 2 indexes • 10 shards each index • 0 replicas
![Page 22: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/22.jpg)
Step #1 - Bulk loading into Neo4j
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties
It took 12 hours on 2 times smaller Amazon instance
![Page 23: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/23.jpg)
Step #2 - Bulk loading into ES
grouped insert
1. Create Source from CSV file, frame by \n2. Decode id from ByteString, generate user json3. Group by the bulk size (eg.: 4500)4. Throttle5. Execute bulk insert into the ElasticSearch
throttle
Akka advantage: CPU utilization,
parallel data enrichment,
human readable
![Page 24: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/24.jpg)
Step #2 - Bulk loading into ES FileIO.fromFile(new File(sourceOfUserIds)) .via( Framing.delimiter(ByteString('\n'), maximumFrameLength = 1024, allowTruncation = false)) .mapAsyncUnordered(16)(prepareUserJson) .grouped(4500) .throttle( elements = 2, per = 2 seconds, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(2)(executeBulkInsert) .runWith(Sink.ignore)
Flow description with Akka Streams
![Page 25: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/25.jpg)
User
User
User
Foo Inc.User
User
User
worked
worked
knows
knows
knows
knows
Step #3 - Tagging
Bar Ltd.
worked
Foo Inc.
User
User
User
User
worked
worked
knows
knows
knows
knows
knows
MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User) WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c) RETURN DISTINCT ID(aquitance)
Neo4j traversal query in CYPHER
![Page 26: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/26.jpg)
Akka Streams to the rescue
val idEnum : Enumeratee[CypherRow] = _
val src = Source.fromPublisher(Streams.enumeratorToPublisher(idEnum)) .map(_.data.head.asInstanceOf[BigDecimal].toInt) .via(new TimeoutOnEmptyBuffer()) .map(UserId(_)) .mapAsyncUnordered(parallelism = 1)(id => dao.tagProfiles(id, companyId))
Readable flow Buffering to protect Neo4j when indexing is too slow Timeout due to the bug in the underlying implementation
![Page 27: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/27.jpg)
Bottleneck
1.5 hour for 3 million subset, that’s too long!
![Page 28: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/28.jpg)
Bulk update with AkkaStream tuning
src .grouped(20000) .throttle( elements = 2, per = 6 second, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(parallelism = 1)(ids => dao.bulkTag(ids, ...))
Bulk tagging, few lines
![Page 29: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/29.jpg)
Tagging Foo-Company
~14 seconds until first batch is tagged
~7 minutes 11 seconds until all tagged
reference implementation - few hours
2,222,840 profiles matching criteria
Sample subset :)
![Page 30: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/30.jpg)
Step #5 - Search
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge 4vCPU 30.5GBi2.2xlarge
8vCPU 61GB
m4.large 2vCPU 8GB
With data ready in ES search implementation was pretty straightforward
![Page 31: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/31.jpg)
Step #5 - Search benchmark
Fulltext search on 2 millions subset
GET /users?company=foo-company&phrase=John
2000 phrases for the benchmarking
Response JSON with 50 profiles
Searching in 750 millions database
Phrases based on real names and surnames, used during profiles
enrichment
Random requests ordering
![Page 32: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/32.jpg)
Step #5 - Search under siegeResponse time ~ 0.14s50 concurrent users
constant latency
constant search rate
![Page 33: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/33.jpg)
Objective achieved
search response time from API ~0.14s
14 seconds until first batch is tagged
7 minutes until 2 millions ready
![Page 34: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/34.jpg)
Summaryreference implementation PoC
manual automatic
few days 14 seconds
few days 7 minutes
~40 millions ~750 millions
no analytics GraphX readyTool ready for data scientists W can implement core traversal modifications almost instantly
![Page 35: Tackling a 1 billion member social network](https://reader038.vdocuments.us/reader038/viewer/2022103105/587117451a28abac6d8b76fd/html5/thumbnails/35.jpg)
Summary
https://snap.stanford.edu/data/
http://neo4j.com/
https://playframework.com/
http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0/scala/stream-index.html
Try at home! Sample graphs ready to use
https://www.elastic.co/
Artur Bań[email protected]@abankowski