analysis of big data and other sources - circabc.europa.eu allegrograph neo4j 7. ... full-text...

Analysis of Big Dataand other sources

1

1. Introduction to big data

2. A survey on tools

3. Data storage in depth

4. Data processing

5. Practice with R:

a. Word count with Spark

b. Graph analysis with Neo4J

Outline

2

Outline




4. Data processing

5. Practice with R:



3

Introduction to Big Data

4

Introduction to Big Data

There are different working areas in big data:

● Data storage

● Data processing

● Data mining

● Data visualisation

● Business Intelligence Systems

5

Outline




4. Data processing

5. Practice with R:



6

A Survey on Tools- Data storage

DOCUMENTS KEY/VALUE COLUMNS GRAPHS

MongoDB

CouchDB

Riak

Riak

Voldemort

Redis

Memcached

Membase

DynamoDB

Google Bigtable

HBase

Cassandra

Sybase IQ

Hypertable

FlockDB

OrientDB

AllegroGraph

Neo4J

7

A Survey on Tools- Data processing

ADQUISITION STORAGE ANALYSIS

BATCH HDFS commands

Scoop

Flume

HDFS

HBase

MapReduce

Spark, SparkQL

Hive

Pig

Cascading

STREAMING Flume Kafka

Kestrel

RabbitMQ

AWS SQS

Storm

Trident

Spark Streaming

Samza

HYBRID Lamda, Kappa, Summingbird, Lambdoop, Apache Flik

8

A Survey on Tools- Data mining

SPSS Weka Rapid Miner Mahout

Gate NLTK KMine OpenNN

Scikit-learn Carrot2 R Torch

RapidMiner IBM Watson SAS Entreprise

Miner

Statistica Data

Miner

Oracle Data

Miner

Microsoft

Analysis Services

LIONSolver ClaraBridge

OP

EN

PR

OP

IET

AR

Y

9

A Survey on Tools- Data visualisation

Vis.js D3.js

CartoDB Plot.ly

Trableau QlikView

R HighCharts

10

A Survey on Tools- Business Intelligence

Pentaho Actuate

SpagoBI JasperReports

Trableau QlikView

Palo Tactic

IBM Cognos MicroStrategy

Microsoft PowerBI Plot.ly

11

Outline




4. Data processing

5. Practice with R:



12

Data Storage in Depth- SQL vs. NoSQL

SQL databases limitations:

● Fixed structure and integrity restrictions

● Ineficiency with large number of insertions,

modifications, deletions

● High complexity to model real-life relationships

NoSQL databases:

● NoSQL = Not only SQL

● Store large volumes of data in small units of time

13

Data Storage in Depth- NoSQL types

There are basically four types of NoSQL databases, although some of them

share characteristics from more than one type:

● Document oriented: The basic unit is the document (e.g. XML,

json, …)

● Key/Value: Any object identified by a key and described by a set

of attributes (values). Also known as hash warehouses

● Column oriented: Data are stored around tables with families of

predefined columns, propitiating OLAP operations

● Graph databases: Not only store objects but also relationships

among them shaping graphs of information14

Data Storage in Depth- Document oriented

● The basic unit is the document

● A document can have an arbitrary number of fields

● Each field can be of different type and size

● Each field can store multiple values

● Examples of documents are XML, JSON, or similar

● Document databases do not need a fixed schema of document

● Each document can have different fields than other documents in

the database

● Security is assigned at document level

● Full-text search capabilities with high performance15


● JSON document example

● Unlike key/value model, id is

part of the document

● Full-text search is provided in

the whole document

16


17

Data Storage in Depth- Key/value warehouses

● Warehouses where store any kind of information of any type

● Objects are identified by a unique key

● Objects are defined by an arbitrary set of attributes

● There is neither structure nor restrictions

● They are also known as hash warehouses

18

Data Storage in Depth- Key/value warehouses

19

Data Storage in Depth- Column oriented

● Unlike SQL databases organised as rows, column-oriented

databases are organised around columns

● Tables are defined as families of columns

● It is easy to implement OLAP operations ○ Drill, roll, slice&dice, pivot

20

Data Storage in Depth- Column oriented

21

Data Storage in Depth- Graph databases

Bob’s friends

Alice’s friends-of-friends

What about big data?

Relational databases lack relationships

22


Relationships can be emulated by aggregated fields, but:

- They should be maintained (update and delete)

programmatically.

- Aggregated links are not reflexive: there is no point

backward (e.g. to know who bought a product).

NoSQL databases also lack relationships

23


A graph is a collection of vertices representing entities and

edges representing the relationships among them.

In a property graph both nodes and relationships can have

properties.

Graph data model means that data are modelled such a graph.

A (property) graph database is an online database management

system with Create, Read, Update and Delete methods that

expose a (property) graph data model.24


Node with a property which value

is “Harry”

Relationship with a property which

value is “Follows”

Property graph

25


Cypher is an expressive graph database query language.

Cypher is designed to be easily read and understood by

developers, database professionals and business stakeholders.

The key of Cypher is that enables to find data that matches a

specific pattern, following our intuition to describe graphs using

diagrams.

26


Relation type

and direction

Nodes

Separation among

subgraphs

27


The simplest query:

- a START clause followed by a MATCH and a RETURN clauses

28


- START: specifies the starting point(s) in the graph (e.g.

nodes or relationships)

- MATCH: describes the specification by example, using

characters to represent nodes and relationships, in order to

draw the data we are interested in.

- RETURN: defines the nodes, relationships and/or attributes

that should be returned.

29


OTHER CYPHER CLAUSES

- WHERE: provides criteria for filtering.

- CREATE (UNIQUE): for the creation of nodes and relationships.

- DELETE: removes nodes, relationships and properties.

- SET: sets property values to nodes and relations.

- FOREACH: allows to perform an updating action for a list of

elements.

- UNION: merges results from different queries.

- WITH: allows to pipe results from one query to the next.30


31

Outline




4. Data processing

5. Practice with R:



32

Data Processing- Types

BATCH STREAMING

VOLUME VELOCITY

HYBRID

● Batch processing for large volumes of information (e.g. ADN

sequentiation)

● Streaming processing for rapid generated data (e.g. Twitter)

● Hybrid processing for large volumes rapidly generated (e.g. in-depth

analysis of Twitter tweets)33

Data Processing- Processing steps

DATA ADQUISITION

DATA STORAGE

DATA ANALYSIS

34

Data Processing- Types

htt

ps://w

ww

.youtu

be.c

om

/watc

h?v=

Yrq

ME

n-5

Pi8

- Retrieve and store

- Evolution

- Words and topics

- Labelling

- Hashtags

- People

- Locations

- Brands

- Polarity, stance

- Users, relationships

- Gender, age

- Author profile

- ...

In-depth analysis of a Twitter stream

tweets/second tweets/minute tweets/hour tweets/day 35

http://youtube.com/v/YrqMEn-5Pi8

https://www.youtube.com/watch?v=YrqMEn-5Pi8

Data Processing- Batch processing

Map/Reduce paradigm:

● Map: The Map process divides the data into subsets and sends them to each

process node in key-value format <K, V>

● Reduce: Each node returns the result in key-list of values format <K, L (V)>

and they are combine to produce the final result

Example of counting words in a text:

● Map: A line of text is sent to each node, where the key K is the line number,

and the value V is the line of text <nline, text>. The result of the task is a list

of pairs <word, 1> for each word in the text.

● Reduce: It collects all the outputs of Map processes as pairs <key, value> or

<word, 1>, and it is responsible for grouping them in pairs <word,

occurrence> by adding the ones of each word36


37


function Map (key, values) {for each word w in values {

return (w, 1)}

}

function Reduce (word, list_of_values) {

for each value v in list_of_values {total += v

}return (word, total)

}38


ADQUISITION STORAGE PROCESSING

39

Data Processing- Stream processing

ADQUISITION STORAGE PROCESSING

KESTREL trident

41

Data Processing- Hybrid processing

42

Data Processing- Hybrid processing

SUMMINGBIRD

43

Outline




4. Data processing

5. Practice with R:



44

Graph Databases. Ian Robinson, Jim Webber and Emil Eifrem. O’Reilly.

http://neo4j.com/books/graph-databases/

● Social Network Data Analytics. Charu C. Aggarwal. Springer.

http://www.springer.com/us/book/9781441984616

● Networks, Crowds and Markets: Reasoning about a Highly Connected

World. David Easly and Jon Kleinberg. Cambridge University Press.

https://www.cs.cornell.edu/home/kleinber/networks-book/

References

45

http://neo4j.com/books/graph-databases/

http://www.springer.com/us/book/9781441984616

https://www.cs.cornell.edu/home/kleinber/networks-book/

● Aggargal, C. C. (2011). Social network data analytics. Springer

● Banker, K. (2012). Mongodb in action. Manning Publications

● Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.

(2008). Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems

● Dixon, J. (2015). Pentaho, hadoop and data lakes. James Dixon’s Blog

● Harrington, P. (2012). Machine learning in action. Manning Publications

● Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on

cloud computing: Review and open research issues. Information Systems

● Hewitt, E. (2011). Cassandra: the definitive guide. O’Reilly

● Jones, O. M., Robinson, A. (2009). Scientific programming and simulation using r. Taylor & Francis Group

● Lam, C. (2011). Hadoop in action. Manning Publications

● Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Stanford University Press

● Owen, S., Anil, R., Dunning, T., Friedman, E. (2013). Mahout in action. Manning Publications Co.

● Snijders, C.; Matzat, U.; Reips, U.D. (2012). Big data: big gaps of knowledge in the field of interent. International

Journal of Internet Science

● Stanton, J. (2012). An introduction to data science. Syracuse University

● Witten, I. H., Frank, E., Hall, M. A. (2011). Data mining. Practical machine learning tools and techniques. Morgan

Kaufmann Publishers

References

46

analysis of big data and other sources - circabc.europa.eu allegrograph neo4j 7. ... full-text...

Documents