Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Download Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Post on 08-Jan-2017

7.498 views

Category:

Technology

3 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Real-time Data Integration with Apache </p><p>Flink &amp; Kafka @Bouygues Telecom</p><p>Mohamed Amine ABDESSEMED</p><p>Flink Forward, Berlin, October 2015</p></li><li><p>About Me</p><p> Software engineer &amp; solution architect @ Bouygues </p><p>Telecom</p><p> My daily Toolbox</p><p> Hadoop ecosystem</p><p> Apache Flink</p><p> Apache Kafka </p><p> Elasticsearch</p><p> Apache Camel </p><p> And more.</p><p> If I dont see me coding, I am probably outside running</p><p>2</p></li><li><p>Outline</p><p> Who we are</p><p> Logged User eXperience</p><p> Challenges</p><p> Typical Data Flow pipeline on Hadoop</p><p> Real-time Data Integration</p><p> Apache Flink : The Elegant</p><p> Data Integration use case</p><p> What we loved using Flink </p><p> Whats Next ?</p><p>3</p></li><li><p>BOUYGUES TELECOM</p><p>14M Clients</p><p>11,4M Mobile </p><p>subscriber</p><p>2,6M Fixed customer</p><p>A very Innovative company</p><p>Leader 4G/4G+/UHMD</p><p>First Android based TV BOX</p><p>4</p><p>Mobile . Fixed . TV . Internet . Cloud</p></li><li><p>BOUYGUES TELECOM</p><p>2135819137</p><p>14479</p><p>10595</p><p>BouyguesTelecom</p><p>Orange Free SFR</p><p>nPerf Mobile Data Networks Global score 2G/3G/4G (Q3-</p><p>2015)</p><p>Bouygues Telecom</p><p>Orange</p><p>Free</p><p>SFR</p><p>71% 72%</p><p>33%</p><p>53%</p><p>BouyguesTelecom</p><p>Orange Free SFR</p><p>Population covered in 4G</p><p>Bouygues Telecom Orange Free SFR</p><p>5</p></li><li><p>AT THE HUB OF OUR 14 MILLION CUSTOMERS' DIGITAL</p><p>LIVES</p><p>AND WE GIVE THEMGENUINEREASONS TO STAY LOYAL</p></li><li><p>LUX: Logged User eXperienceMobile QoE</p><p> Produce Mobile QoE indicators from massive </p><p>network equipments event logs (4 Billions/day).</p><p> Goals:</p><p> QoE (User) instead of QoS (Machine).</p><p> Real-time Diagnostic (</p></li><li><p>LUX: Logged User eXperienceMobile QoE</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>8</p><p>Challenges</p></li><li><p>Challenges</p><p>1. Data movement </p><p>9</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p><p>log</p><p>s</p></li><li><p>Challenges</p><p>1. Data movement </p><p>2. Data Processing</p><p> Data is generally too raw to be used directly.</p><p> How can we transform it ?</p><p> How can we make the results available as soon as possible ?</p><p>10</p></li><li><p>Typical Data Flow pipeline on Hadoop</p><p>HADOOP</p><p>FTP</p><p>HTTP</p><p>SCOOP</p><p>FLUME</p><p>FS PUT</p><p>Client </p><p>1</p><p>Client </p><p>2</p><p>Client</p><p>...</p><p>Client </p><p>x</p><p>Impala</p><p>Hive</p><p>Hbase</p><p>TO</p><p>System 1</p><p>System 2</p><p>System ...</p><p>System x</p><p>Raw </p><p>Data</p><p>DFS</p><p>Enriched </p><p>Data</p><p>V1</p><p>DFS</p><p>Enriched </p><p>Data</p><p>V...</p><p>DFS</p><p>Enriched </p><p>Data</p><p>Vx</p><p>DFS</p><p>Data </p><p>source 1</p><p>Data </p><p>Source 2</p><p>Data </p><p>Source x</p><p>Data </p><p>Source ...</p><p>Batch Batch Batch</p><p>11</p></li><li><p>12</p><p>THE WORLD MOVES FAST</p><p>DATA MUST MOVE FASTER</p></li><li><p>Real-time Data Integration</p><p> Inspired by Linkedins Kafka Data </p><p>Integration design pattern.</p><p> Take all the data and it into a central log </p><p>repository for real-time subscription.</p><p>What if the data is too raw to </p><p>used, even binary encoded with </p><p>no visible business logic </p><p>information?13</p></li><li><p>Real-time Data Integration</p><p>Solution 01: Process Data before pushing it to Kafka</p><p> Not viable:</p><p> Data sources have limited computation resources dedicated to </p><p>log collection.</p><p> Not scalable.</p><p> Too hard to maintain.</p><p> Weve to push it RAW.</p><p>14</p></li><li><p>Real-time Data Integration</p><p>Solution 01: Process Data before pushing it to Kafka</p><p> Solution 02: The consumers/Data subscribers have to </p><p>process Data before using it.</p><p> Drawbacks:</p><p> All the consumers must implement the same business logic, run it </p><p>against the same Data.</p><p> Any changes/updates in the processing logic will require an update of </p><p>all the consumers. </p><p>Provide a useable Data format to each consumer/Data </p><p>subscriber.</p><p>15</p></li><li><p>Real-time Data Integration</p><p>Solution 01: Process Data before pushing it to Kafka</p><p>Solution 02: The consumers/Data subscribers have to process Data before using it</p><p>Solution 03: Process Kafkas raw Data and push it back in </p><p>decoded/enriched format for subscribers. </p><p> Benefits:</p><p> The business logic will be implemented in one place.</p><p> Resource efficient.</p><p> Data subscribers can focus more on their own business logic.</p><p> Simple handling of sources/clients evolution. </p><p> Challenges:</p><p> Keep the Data moving real time.</p><p> The Data processing pipeline must be very fast</p><p>16</p></li><li><p>Real-time Data Integration</p><p>Client 1</p><p>Client 2</p><p>Client...</p><p>Client x</p><p>HADOOP</p><p>Impala</p><p>Hive</p><p>Hbase</p><p>Kafka TOPIC RAW</p><p>Kafka TOPIC V1</p><p>Kafka TOPIC V.</p><p>Kafka TOPIC Vx</p><p>SYS </p><p>1TO</p><p>SYS </p><p>2</p><p>SYS </p><p>...</p><p>TO</p><p>Collector </p><p>App</p><p>Kafka </p><p>Producer</p><p>ARC</p><p>HIV</p><p>E</p><p>SYS </p><p>x</p><p>Enriched </p><p>Data</p><p>Vx</p><p>DFS</p><p>Raw </p><p>Data</p><p>DFS</p><p>Enriched </p><p>Data</p><p>V1</p><p>DFS</p><p>Enriched </p><p>Data</p><p>V...</p><p>DFS</p><p>Data </p><p>source 1</p><p>Data </p><p>Source 2</p><p>Data </p><p>Source x</p><p>Data </p><p>Source ...</p><p>Streaming</p><p>Streaming</p><p>Streaming</p><p>17</p></li><li><p>Real-time Data Integration with </p><p>Apache Kafka and Spark?</p><p> Started a POC on Spark Streaming.</p><p> Didnt answer our needs:</p><p> Poor back pressure handling</p><p>Jobs kept failing OOM on busy hours.</p><p> Micro-batching &amp; Latency.</p><p> Many configuration parameters:</p><p> The tested version used an HDFS WAL as fault tolerance </p><p>mechanism but this should be handled by Kafka.</p><p>18</p></li><li><p>Apache Flink : The Elegant</p><p> True streaming, no more micro-batching !</p><p> Nice back pressure handling.</p><p> Fault tolerant, exactly once processing.</p><p> High-Throughput.</p><p> Scalable.</p><p>An open source platform for distributed stream and batch data processing.</p><p> Rich functional APIs.</p><p> (Almost) no constraints on serialization.</p><p> Control of parallelism at all execution levels.</p><p> Flexibility and ease of extension.</p><p> And more nice stuffs.</p><p>19</p></li><li><p>IoT / Mobile Applications</p><p>20</p><p>Events occur on devices</p><p>Queue / Log</p><p>Events analyzed in a</p><p>data streaming </p><p>system</p><p>Stream Analysis</p><p>Events stored in a log</p></li><li><p>LUX: Logged User eXperience</p></li><li><p>LUX: Logged User eXperience</p><p>4 Billions</p><p>raw events/day</p><p>700 GB/day(Raw Data / snappy </p><p>compressed)</p><p>100 Data Sources</p><p>6 Main Data </p><p>Types</p><p>26 Kafka Topics</p><p> 6 raw </p><p> 20 enriched</p><p>CDH5 Cluster</p><p>2 Brokers Kafka </p><p>Cluster</p><p>20 DataNodes</p><p>750TB</p><p>22</p></li><li><p>23</p><p>Mobile CDR use case</p><p>Client 1</p><p>Client 2</p><p>Client...</p><p>Client x</p><p>HADOOP</p><p>CDR_BIN</p><p>CDR_DECODED</p><p>CDR_ENRICHED</p><p>CDR_ENRICHED_ELASTIC</p><p>Planck-</p><p>Collector </p><p>DECODED</p><p>CDR</p><p>HDFS</p><p>Elast icsearch</p><p>2 Weeks retention</p><p>Other IT Systems ( </p><p>commercial, ...)</p><p>KPI </p><p>Views</p><p>HDFS</p><p>Each Machine </p><p>generates a binary file </p><p>every 5min or 2MB</p><p>Binary </p><p>Decoder</p><p>Common </p><p>CDR </p><p>Enricher</p><p>Lookup Live </p><p>Reference Data</p><p>Alarms/Live </p><p>Counters</p><p>Zabbix</p><p>Elast icsearch </p><p>Formater</p><p>15 min </p><p>Window </p><p>Counters</p><p>Historical </p><p>Data</p><p>ENRICHED </p><p>CDR</p><p>HDFS</p><p>BINARY</p><p>CDR</p><p>HDFS</p><p>REFERENCE </p><p>DATA</p><p>HDFS</p><p>K2ES</p><p>LOGSTASH</p><p>Kafka </p><p>Mirroring</p><p>Lookup Live </p><p>Reference Data</p><p>Network </p><p>equipment</p><p>Network </p><p>equipment</p><p>Network </p><p>equipment</p><p>Network </p><p>equipment</p><p>Network </p><p>equipment</p><p>Network </p><p>equipment</p></li><li><p>And it Rocks !</p><p> We ran stress tests on our biggest raw Kafka topic:</p><p> A day of Data.</p><p> 2 Billions events (480Gib compressed).</p><p> 10 Kafka partitions</p><p> 10 Flink TaskManagers (Only 1GB Memory each) </p><p>Enrich Rate (in Tickets/second)Total Processing time (ms)</p><p>Kafka I/O Duration (ms)</p><p>24</p></li><li><p>And it Rocks !</p><p> We ran stress tests on our biggest raw Kafka topic:</p><p> A day of Data.</p><p> 2 Billions events (480Gib compressed).</p><p> 10 Kafka partitions</p><p> 10 Flink TaskManagers (Only 1GB Memory each) </p><p>Total Processing time (ms)</p><p>Kafka I/O Duration (ms)</p><p>25</p><p>500 000 events/sec</p><p>1 day data </p><p>processed in 1h.</p><p>Enrich Rate (in Tickets/second)</p><p>Less than 200 ms</p><p>Processing Time</p></li><li><p>What we loved using </p><p>Flink/Notable features</p><p> Development cost.</p><p> Ease of testing &amp; development.</p><p>Works exactly the way you expect it to work.</p><p>Local Execution mode.</p><p> No more OOM.</p><p> Efficient resource management.</p><p> Excellent performance even with limited </p><p>resources.</p><p>26</p></li><li><p>What we loved using </p><p>Flink/Notable features</p><p>Viele danke Data-Artisans ! </p><p>Merci Beaucoup</p><p> True streaming from different sources including </p><p>Kafka</p><p>Exactly-once, low-latency, High-throughput </p><p>stream processing.</p><p> Yarn mode features:</p><p>Yarn yarn.maximum-failed-containers</p><p>Yarn detached-mode</p><p>27</p></li><li><p>Whats Next?</p><p> Connect LUX to new sources.</p><p> Use of JobManager High Availability</p><p> Archive Data on HDFS using the new </p><p>filesystem sink.</p><p> Index Elasticsearch Data using the new </p><p>Elasticsearch sink.</p><p> Flink ML.</p><p> Contributions to the Flink Project.</p><p>28</p></li><li><p>Questions</p></li></ul>