10 big data technologies you didn't know about
TRANSCRIPT
![Page 1: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/1.jpg)
Big Data Technologies You Didn’t Know About
![Page 2: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/2.jpg)
About Us
• Emerging technology firm focused on helping enterprises build breakthrough software solutions
• Building software solutions powered by disruptive enterprise software trends
-Machine learning and data science -Cyber-security -Enterprise IOT -Powered by Cloud and Mobile• Bringing innovation from startups and academic institutions to the enterprise
• Award winning agencies: Inc 500, American Business Awards, International Business Awards
![Page 3: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/3.jpg)
• Big data technologies you didn’t know about• Apache Flink• Apache Samza • Google Cloud Data Flow• StreamSets• Tensor Flow• Apache NiFi• Druid• LinkedIn WhereHows• Microsoft Cognitive Services
Agenda
![Page 4: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/4.jpg)
Two Goals…
![Page 5: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/5.jpg)
Think Beyond Traditional Big Data Stacks
![Page 6: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/6.jpg)
Learn from Companies Building Big Data Pipelines at Scale
![Page 7: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/7.jpg)
Big Data pipelines in the enterprise
![Page 8: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/8.jpg)
Areas of a Big Data Pipeline
Big Data
Pipeline
Data Processing
Stream Data Ingestion
Data transformations
Cognitive Computing
Machine Learning
High Performance Data Access
![Page 9: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/9.jpg)
Data Processing
![Page 10: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/10.jpg)
Technology Stacks You Know
![Page 11: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/11.jpg)
But You Probably Didn’t Know About….
![Page 12: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/12.jpg)
Apache Flink
• Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics.
• Apache Flink engine exploits data streaming and in-memory processing and iteration operators to improve performance.
• Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in 2008 by professor Volker Markl from the Technische Universität Berlin in Germany.
• In German, Flink means agile or swift. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.
![Page 13: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/13.jpg)
Apache Flink
• Declarativity• Query optimization• Efficient parallel in-
memory and out-of-core algorithms
• Massive scale-out• User Defined
Functions • Complex data types• Schema on read
• Streaming• Iterations• Advanced
Dataflows• General APIs
Draws on concepts fromMPP Database
Technology
Draws on concepts fromHadoop MapReduce
Technology Add
![Page 14: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/14.jpg)
Apache Flink
![Page 15: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/15.jpg)
Apache Flink: An Example
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print()
val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()
DataSet API (batch):
DataStream API (streaming):
![Page 16: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/16.jpg)
Stream Data Processing
![Page 17: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/17.jpg)
Technology Stacks You Know
![Page 18: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/18.jpg)
But You Probably Didn’t Know About….
![Page 19: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/19.jpg)
Apache Samza
• Created by LinkedIn to address extend the capabilities of Apache Kafka• Simple API• Managed state• Fault Tolerant• Durable messaging• Scalable• Extensible• Processor Isolation
![Page 20: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/20.jpg)
Apache Samza: Overview
• Samza code runs as a Yarn job
• You implement the StreamTask interface, which defines a process() call.
• StreamTask runs inside a task instance, which itself is inside a Yarn container.
![Page 21: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/21.jpg)
Apache Samza: Operators
• Filter records matching condition
• Map record ⇒ func(record)• Join two/more datasets by
key• Group records with the same
value in field• Aggregate records within the same group• Pipe job 1’s output ⇒ job 2’s input
• MapReduce assumes fixed dataset.Can we adapt this to unbounded streams?
![Page 22: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/22.jpg)
Apache Samza: Sample Code
![Page 23: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/23.jpg)
Data Transformation
![Page 24: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/24.jpg)
Technology Stacks You Know
![Page 25: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/25.jpg)
But You Probably Didn’t Know About….
![Page 26: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/26.jpg)
Google Cloud Data Flow
• Native Google Cloud data processing service
• Simple programming model for batch and streamed data processing tasks
• Provides a data flow managed service to control the execution of data processing jobs
• Data processing jobs can be authored using the Data Flow SDKs (Apache Beam)
![Page 27: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/27.jpg)
Google Cloud Data Flow : Details
• A pipeline encapsulates an entire series of
computations that accepts some input
data from external sources, transforms
that data produces some output data.
• A PCollection abstracts a data unit in a pipeline
• Sources and Sink abstract read and write operations in a pipeline
• Google Data Flow provides management, monitoring and security capabilities in data pipelines
![Page 28: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/28.jpg)
Google Cloud Data Flow is Based on Apache Beam
• 1. Portable - You can use the same code with
different runners (abstraction) and backends on
premise, in the cloud, or locally
• 2. Unified - Same unified model for batch and
stream processing
• 3. Advanced features - Event windowing,
triggering, watermarking, lateless, etc.
• 4. Extensible model and SDK - Extensible API;
can define custom sources to read and write in
parallel
![Page 29: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/29.jpg)
But You Probably Didn’t Know About….
![Page 30: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/30.jpg)
StreamSets Data Collector
• Data processing platform optimized for data in motion
• Visual data flow authoring model• Open source distribution model• On-premise and cloud distributions• Rich monitoring and management
interfaces
![Page 31: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/31.jpg)
StreamSets Data Collector: Details
• Data collectors streams and process data in real time using data pipelines
• A pipeline describes a data flow from origin to destination
• A pipeline is composed of origins, destinations and processors
• Extensibility model based on JavaScript and Jython
• The lifecycle of a data collector can be controlled via the administration console
![Page 32: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/32.jpg)
Machine Learning
![Page 33: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/33.jpg)
Technology Stacks You Know
![Page 34: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/34.jpg)
But You Probably Didn’t Know About….
![Page 35: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/35.jpg)
TensorFlow
• Second generation Machine Learning system, followed by DistBelief• TensorFlow grew out of a project at Google, called Google Brain, aimed
at applying various kinds of neural network machine learning to products and services across the company.
• An open source software library for numerical computation using data flow graphs
• Used in following projects at Google1. DeepDream2. RankBrain3. Smart ReplyAnd many more..
![Page 36: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/36.jpg)
TensorFlow: Details
• Data flow graphs describe mathematical computation with a directed graph of nodes & edges
• Nodes in the graph represent mathematical operations.
• Edges represent the multidimensional data arrays (tensors) communicated between them.
• Edges describe the input/output relationships between nodes.
• The flow of tensors through the graph is where TensorFlow gets its name.
![Page 37: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/37.jpg)
TensorFlow
• Tensor• Variable• Operation• Session• Placeholder• TensorBoard
![Page 38: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/38.jpg)
Fast Data Access
![Page 39: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/39.jpg)
Technology Stacks You Know
![Page 40: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/40.jpg)
But You Probably Didn’t Know About….
![Page 41: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/41.jpg)
Druid
• Druid was started in 2011• ‣ Power interactive data applications• ‣ Multi-tenancy: lots of concurrent users• ‣ Scalability: trillions events/day, sub-second queries• ‣ Real-time analysis• Key Features
• LOW LATENCY INGESTION• FAST AGGREGATIONS• ARBITRARY SLICE-N-DICE CAPABILITIES• HIGHLY AVAILABLE• APPROXIMATE & EXACT CALCULATIONS
![Page 42: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/42.jpg)
Druid: Details
• Realtime Node• Historical Node• Broker Node• Coordinator Node• Indexing Service
![Page 43: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/43.jpg)
Druid: Details
• Realtime Node• Historical Node• Broker Node• Coordinator Node• Indexing Service• JSON based query
language
![Page 44: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/44.jpg)
Low Latency Data Flows
![Page 45: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/45.jpg)
Technology Stacks You Know
![Page 46: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/46.jpg)
But You Probably Didn’t Know About….
![Page 47: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/47.jpg)
Apache NiFi
• Powerful and reliable system to process and distribute data
• Directed graphs of data routing and transformation
• Web-based User Interface for creating, monitoring, & controlling data flows
• Highly configurable - modify data flow at runtime, dynamically prioritize data
• Data Provenance tracks data through entire system
• Easily extensible through development of custom components
![Page 48: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/48.jpg)
Apache NiFi: Architecture
![Page 49: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/49.jpg)
Apache NiFi: Concepts
• FlowFile• Unit of data moving through the system• Content + Attributes (key/value pairs)
• Processor• Performs the work, can access FlowFiles
• Connection• Links between processors• Queues that can be dynamically
prioritized
• Process Group• Set of processors and their connections• Receive data via input ports, send data
via output ports
![Page 50: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/50.jpg)
Data Discovery
![Page 51: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/51.jpg)
Technology Stacks You Know
![Page 52: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/52.jpg)
But You Probably Didn’t Know About….
WhereHows
![Page 53: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/53.jpg)
Linkedin WhereHows
• Where is my data? How did it get there?
• Enterprise data catalog
• Metadata search
• Collaboration
• Data lineage analysis
• Connectivity to many data sources and
ETL tools
• Powering Linkedin data discovery layer
![Page 54: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/54.jpg)
Linkedin WhereHows: Architecture
• Web interface for data discovery
• API enabled
• Backend server that controls the metadata
crawling and integration with other
systems
![Page 55: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/55.jpg)
Linkedin WhereHows: Data Lineage
• Collects metadata from ETL platforms and
scripts
• Sources include
• Pig
• MapReduce
• Informatica
• Teradata
• Visualizes the lineage information
associated with a data source
![Page 56: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/56.jpg)
Cognitive Computing
![Page 57: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/57.jpg)
Technology Stacks You Know
![Page 58: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/58.jpg)
But You Probably Didn’t Know About….
![Page 59: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/59.jpg)
Microsoft Cognitive Services
• Based on Project Oxford and Bing
• Offers 22 cognitive computing APIs
• Main categories include:
• Vision
• Speech
• Language
• Knowledge
• Search
• Integrated with Cortana Intelligence Suite
![Page 60: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/60.jpg)
Microsoft Cognitive Services
![Page 61: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/61.jpg)
Microsoft Cognitive Services: Developer Experience
• 22 different REST APIs
that abstract cognitive
capabilities
• SDKs for Windows, IOS,
Android and Python
• Open source
![Page 62: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/62.jpg)
Summary
• The big data ecosystem is constantly evolving• There are a lot of relevant new technologies beyond the traditional Hadoop-
Spark stacks• Big internet companies are leading innovation in the space
![Page 63: 10 Big Data Technologies you Didn't Know About](https://reader031.vdocuments.us/reader031/viewer/2022030217/58873baa1a28abc0748b6ad7/html5/thumbnails/63.jpg)
Thankshttp://[email protected]