is spark the right choice for data analysis ?

Is Spark the right choice for Data Analysis ?

Ahmed Kamal, Big Data Engineerhttp://ahmedkamal.me

Resources ?

●“Advanced Analytics using Spark”, a practical book !

●“The thing I like most about this book is its focus on examples, which are all drawn from real applications on real-world data sets.” - Matei Zaharia, CTO at Databricks.

●It is all about developing data applications using Spark

Data Applications, like what ?

●Build a model to detect credit card fraud using thousands of features and billions of transactions.

●Intelligently recommend millions of products to millions of users.●Estimate financial risk through simulations of portfolios including millions of instruments.●Easily manipulate data from thousands of human genomes to detect genetic associations with disease.

Doing something useful with data

●Often, “doing something useful” = Placing a schema over it and using SQL to answer questions like

●“of the gazillion users who made it to the third page in our registration process, how many are over 25?”

●The field of how to structure a data warehouse and organize information to make answering these kinds of questions easy is a rich one.

A new superpower !

●When people say that we live in an age of “big data,” they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of.

●There is a gap between having access to these tools and all this data, and doing something useful with it.

Doing extra useful things

●Requirements :

a- Flexible programming model

b- Rich functionality in machine learning and statistics

●Existing Tools :

R, Python (PyData stack) and Octave

Pros : Little effort, easy to use

Cons : Viable only to small data sets, too complex to redesign to be suitable for working over clusters of computers.

Why is it difficult ?

●Some algorithms (like machine learning algos) would have wide data dependencies.

•Data are partitioned across nodes. •Network transfer is much sloooower than memory accesses.

●What about the probability of failures ?

●Summary : We need a programming paradigm that is sensitive to the c/c of the underlying system and that encourages good choices and make it easy to write parallel code.

High performance Computing

●Use Case : processing a large file full of DNA sequencing reads in parallel

●1- Manually split the file into smaller files

●2- Submitting a job for each file split to the scheduler

●3- Continuous jobs monitoring to resubmit any failed jobs

●All to all operations like sorting the full data would require streaming through one node or to go and use MPI.

●Relatively low level of abstraction and difficulty of use in addition to the high cost.

The 3 truths about data science

●Successful data preprocessing is a must for successful analysis.

–Large data sets requires special treatment

–Feature engineering should be given more time than the time spent on the algorithms stuff. (A model for fraud detection can use IP location info, login times, click logs)

–How would you convert features into vectors suitable for ML algorithms.

The 3 truths about data science

●Iteration is the key.

–Famous optimization techniques like Gradient Descent requires repeated scans over the input until convergence

–You can't get it right from the first time. (Features/Algo/Test)

●If it is for production the model would be rebuilt periodically.

Analytics between lab and factory

A framework that makes modeling easy but is also a good fit for production systems is a huge win.

Apache Spark In Points

●Spark continues from what Hadoop Shines at (Linear Scalability , Fault Tolerance)

●Spark supports DAG (Direct Acyclic Graph of operators)

●Complements its capabilities with rich set of transformations.

●In-memory processing. (Suitable for iterations)

Apache Spark In Points

●The most important bottleneck that Spark addresses is analyst productivity. (R, HDFS, MR, .. etc)

●Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems.

●Standing on top of JVM – Good integration with Hadoop ecosystem

Spark From the other side !●Still young compared to MapReduce

●Its main components needs a lot of work to be mature enough (stream processing, SQL, machine learning, and graph processing)

–MLlib’s pipelines and transformer API model is in progress

–Its statistics and modeling functionality comes nowhere near that of single machine languages like R

–Its SQL functionality is rich, but still lags far behind that of Hive.

Spark Programming Model

●It starts with a dataset or a few residing in a distributed persistent storage (like HDFS)

●Writing a Spark program typically consists of a few related steps:

–Defining a set of transformations on input data sets.

–Invoking actions that output the transformed data sets to persistent storage or return results to the driver’s local memory.

–Running local computations that operate on the results computed in a distributed fashion. These can help you decide what transformations and actions to undertake next.

Why should you consider Scala ?

●Spark has already different wrappers (Java, python)

●It reduces performance overhead. (Running your different language of top of JVM)

●It gives you access to the latest and greatest.

●It will help you understand the Spark philosophy.

–If you know how to use Spark in Scala, even if you primarily use it from other languages, you’ll have a better understanding of the system and will be in a better position to “think in Spark.”

If you are immune to boredom, there is literally nothing you cannot accomplish.—David Foster Wallace

Data Science's First Step

●Data cleansing is the first step in any data science project.

●Many clever analyses have been undone because the data analyzed had fundamental quality problems or bias problem.

●It is a dull work that you have to do before you can get to the really cool machine learning algorithm that you’ve been dying to apply to a new problem.

Our First Real Problem !

●Name : Record Linkage

●Description :

–we have a large collection of records from one or more source systems

–it is likely that some of the records refer to the same underlying entity, such as a customer, a patient.

–Each of the entities has a number of attributes, such as a name or address

–we will need to use these attributes to find the records that refer to the same entity.

The Challenge

●Challenge :

–The values of these attributes aren’t perfect

–Values might have different formatting, or typos, or missing information.

–It is easy for a human to understand and identify at a glance, but is difficult for a computer to learn.

Steps we are going to take

●Bringing Data from the Cluster to the Client

●Shipping Code from the Client to the Cluster

●Structuring Data with Tuples and Case Classes

●Getting some numbers regarding our data.

The End

Thanks A lot :)

is spark the right choice for data analysis ?

Data & Analytics