matthias kricke_martin grimmer_michael schmeißer - building a real time tweet map with flink in six...
TRANSCRIPT
![Page 1: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/1.jpg)
Building a real time Tweet map with Flink in six weeksOSTMapFast poc development with flink
![Page 2: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/2.jpg)
Proof of concept - an important tool in the industry
• PoC often necessary to show feasibility to customers
• touch several topics:• Scalability• Stream processing• Batch processing• Storage and querying of data
• OSTMap as example PoC
![Page 3: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/3.jpg)
Goals for OSTMap
• Increase trust into big data technologies on customer side• It is easy to build an application
with current technologies• With almost no experience
• Teach students big data technologies• Recruiting
• Bring big data to the university
• Build a real time application to view recent geotagged tweets on a map
• Search for terms and users, show these tweets on a map
• Analytics:• First data science jobs• …
![Page 4: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/4.jpg)
Industry in practice: IT-Ringvorlesung 2016
• A course at the University of Leipzig.• work on projects of local companies• six students• over a period of 6 weeks - no full time
invest• Weekly meetings• Github project: github.com/IIDP/OSTMap
Nico Graebling Vincent Märkl
Hans Dieter Pogrzeba
Christopher SchottChristopher Rost
Kevin Shrestha
Michael Schmeißer
Martin Grimmer
Matthias Kricke
OSTMap
![Page 5: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/5.jpg)
mgm technology partnersWe bring applications into production!
• Innovative software solution provider with application responsibility
• Specialist for highly scalable, transactional online applications
• Central lines of business: Insurance, E-Commerce, E-Government
• Founded in 1994
• 347 employees, 9 offices (2014)
• Revenue: 43,7 Mio € (2014)
• Part of Allgeier SE
![Page 6: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/6.jpg)
ScaDSCompetence center for scalable data services and solutions Dresden/Leipzig
• bundled Big Data research expertise of the TU Dresden and Leipzig University
• Drive Big Data innovations• Bring industry and science together• Knowledge exchange and transfer
![Page 7: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/7.jpg)
Walking skeleton“A Walking Skeleton is a tiny implementation of the system that performs a small end-to-end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel.” - Alistair Cockburn
gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a-walking-skeleton
![Page 8: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/8.jpg)
Milestone 1read stream, store data as json file, show tweets, read data from json files
![Page 9: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/9.jpg)
Milestone 2write to and read from accumulo, show tweets on map, full table scans, slow visualization
![Page 10: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/10.jpg)
Milestone 3Term index, geotemporal index, ui improvements, clustering, …
![Page 11: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/11.jpg)
OSTMap – stream, batch, storage and querying
geotagged tweets
webservice
a) stream processing
b) batch processing
c) querying data
![Page 12: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/12.jpg)
Stream processing of incoming data – first version
GeoTweetSource KeyGeneration RawTweetSinkDateExtraction
This enabled us to build a slow term search and a slow map search via full table scans.
time index
data for
![Page 13: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/13.jpg)
Stream processing of incoming data – final version
TermIndexSink
GeoTweetSource KeyGeneration RawTweetSinkDateExtraction
Now we were able to build a faster term and map search and language frequency visualization.
time indexTermExtraction
(tokenizing)
UserExtraction
LanguageFrequencySink
LanguageExtraction
term index
language statistics
GeoTemporalIndexCreation
GeoTemporalIndexSink geotemporal index
data for
1 minute window
sum by language
![Page 14: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/14.jpg)
Batch processing• Initial creation of the term index and
geotemporal index for already processed tweets• Data export• Other statistics like:
• Area/ tweet distance a user covers with his tweets
![Page 15: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/15.jpg)
Storage
Table Row Column Family Column Qualifier Value
RawTweetData (TimeIndex)
timestamp, hash8b + 4b - - raw tweet json
TermIndex term field (user,text) RawTweetData key12b -
LanguageFrequency time bucketYYYYMMDDhhmm language-tag - tweet count
4b
Accumulo table design
![Page 16: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/16.jpg)
Geotemporal Index for OSTMapGeo index
geo data
geohashes usedas row keys in accumulo
…3z6b6c6f6q9p9r9x9zd0d1d2d3d4d5d6…dg
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash (z curve)
function from 2d coordinate space to 1d key space
Row CF CQ
geohash RawTweetKey -
![Page 17: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/17.jpg)
Geotemporal Index for OSTMapGeo index – querying?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash
bounding box
calculate coverage of
bounding box
range: [9p]
calculate scan ranges from
coverage
range: [9r]
range: [d0,d1,d2,d3]
…3z6b6c6f6q9p9r9x9zd0d1d2d3d4d5d6…dg
accumulo iteratorsaccumulo iterators
accumulo iterators
result
Row CF CQ
geohash RawTweetKey lat/lon
![Page 18: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/18.jpg)
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMapAdd some time!
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,with timebuckets
…13z16b16c16f16q19p19r19x19z1d01d11d21d31d41d51d6…1dg
day
lon
lat
…23z26b26c26f26q29p29r29x29z2d02d12d22d32d42d52d6…2dg
…
Row CF CQ
day, geohash
RawTweetKey lat/lon
day 1 day 2 day i …
![Page 19: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/19.jpg)
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMapWhat about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,with timebuckets
…13z16b16c16f16q19p19r19x19z1d01d11d21d31d41d51d6…1dg
day
lon
lat
…23z26b26c26f26q29p29r29x29z2d02d12d22d32d42d52d6…2dg
…
Row CF CQ
day, geohash RawTweetKey lat/lon
![Page 20: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/20.jpg)
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMapWhat about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,with timebuckets
day
lon
lat
…12d212d312d4
……
Row CF CQ
sb, day, geohash
RawTweetKey lat/lon
…11d211d311d4
…
…02d202d302d4
……
…01d201d301d4
…
…22d222d322d4
……
…21d221d321d4
…
…
spreading byte
node 0
node 1
node 2
node n
• spreading byte = hash(tweet) % 255• reproducable• pre table splits in accumulo
![Page 21: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/21.jpg)
demo
![Page 22: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks](https://reader035.vdocuments.us/reader035/viewer/2022070516/5875511d1a28ab00528b46eb/html5/thumbnails/22.jpg)
Martin Grimmer grimmer[at]informatik.uni-leipzig.deMatthias Kricke kricke[at]informatik.uni-leipzig.de
www.mgm-tp.comwww.scads.de
Thank you
Michael Schmeißer michael.schmeisser[at]mgm-tp.com