corso di comunicazione digitale...

29
Beyond databases The increasing convergence between databases and search engines. Data models and physical storage Safety, performance, cost, and usability SBA (Search based applications) The case of social media Architecture of the infrastructure and data Classification of user-generated content

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Beyond databases

The increasing convergence between databases and search

engines.

Data models and physical storage

Safety, performance, cost, and usability

SBA (Search based applications)

The case of social media

Architecture of the infrastructure and data

Classification of user-generated content

Page 2: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Convergence Database - Search Engines

What is changing in the management of information on computer?

The boundary between databases and search engines is

disappearing

A new class of software: Search Based Applications (SBA)

Database: creation, storage, maintenance and access to structured

data (discrete units of information and their relationships)

Search engines: search in a document of some information

(usually text) in unstructured data

Page 3: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Definition of SBA

We define a Search-Based Application

A software built on top of a search engine or a database, the

purpose of which is not the simple information retrieval but

access, the analysis and discovery of information mission-oriented

Examples: Customer services

Logistics

targeted Advertising

Decision support

e-Discovery

Fonte: Susan Feldman, IDC LINK, June 2010

Page 4: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

The purpose of SBA

The main purpose of the SBA is to overcome the common defects

of IS (databases or search engines):

obsolete data

low usability

Packaging of user requirements

Rigidity of the system

Limited scalability

Example of SBA: sourcier by exalead

Page 5: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Sourcier by exalead

Page 6: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Goals of SBA

The data are exploding thanks to new technologies 'capture' of

data: bar code scanners and Q-code, RFID, GPS, cloud, mobile

computing, virtualization and 3D ..

This has increased the importance of having tools.

Search in huge amounts of data and distributed

Integrate data from different sources (physical and virtual)

Improve ease of access for non-experts

Digital information (2009): 0.8 zetabytes (2020): 35 zetabytes

The information must be made understandable and useful to as

many users as possible.

Page 7: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Search engines - Google and its ancestors: Cleverdon (1950),

Masterman (1958), Salton. SMART (1971). Model: from model to

approximate Full-text search

Database - From the first examples of database (OLTP, Online

Transaction Processing) OLAP (Online Analytical Processing) and

DIS data warehouse. Main problem: consistency

Features Engines Database

Origin Web Enterprise

Main use Information retrieval Transaction processing

Target Unstructured text Structured data

N. Of users Illimited LImited

Data volume Billion of records Million of records

History: Origins and Evolution (1/2)

Page 8: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

History: Origins and Evolution (2/2)

Factors that have facilitated the convergence and the emergence of

SBA:

Search engines come in companies (Verity Topics Engine,

1988) http://www.answerw.com/topic/verity-inc

The database must be online (e-commerce)

The main structural and conceptual changes:

Changing patterns in your data

Logical structures of recordings

Procedures for data collection / population

New methods of data processing and data retrieval

Page 9: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data models and Storage (1/2)

In the early days of the Web, the document was the Web page

(with keywords, descriptive information on the title, author ...

metadata). Today, more generally, a Word document, presentation,

an e-mail.

Features Engine Database

Semantic model Document Relational model

Logical storing structure Index Relation

Representation Not normalized Normalized

Storing architecture Distributed Centralized

Page 10: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data models and Storage (2/2)

The index of a search engine is the main storage and contains the

information required to locate the information: location in the file

system, URL.

Ability to create copies in cache (compressed copy selecting key

information and associated metadata)

Features Engine Database

Semantic model Document Relational model

Logical storing structure Index Relation

Representation Not normalized Normalized

Storing architecture Distributed Centralized

Page 11: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Example of a full-text search engine

Before you create an index direct (forward index) on the documents

that you want to do research

Then you create an inverted index (inverted index): frequency and

position

Document Words

Doc1 Cane nero salta cancello cane addenta osso

Doc2 piccoli bambini giocano slitta natale

Doc3 Cavalli galoppano cane gioca

ID Termine Documento

1 Cane Doc1: 2: 1, 5 Doc3: 1:3

2 Cavalli Doc3: 1: 1

3 Bambini Doc2: 1: 2

Page 12: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Database

Normalized relational model (Codd, 1990): the relational schema

of a normalized database serves both as a model and as a

conceptual framework for storage

Product ID Producer Product name Price Color

123 Produtt1 Pen 2,20 Blue

124 Produtt2 Pencil 1,20 red

125 Produtt3 Rubber 0.80 yellow

ID_producer Name Address

Produtt1 Paperino Fabriano, via mascagni 35, tel 0…

Produtt2 Pluto Milano, Viale Zara 302…

Page 13: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

What has changed recently: Engines

Evolve the concept of document: not only text but much more.

This creates a more complex version of the document

Text Document Figurative Document

Document attributes

Date

Title

Author

Content

Security

Language

File dimension

Document attributes

DateModified

Source

Content(product_ID,

Product_name, Price,

Colour)

Security

Language…

Page 14: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

What has changed recently: Database

Leave the relational database model based on normalized tables of

rows with information. Change the data model and storage.

Born NoSQL database type (Leavitt 2010) or VLDB (Very Large

Databases):

Key-value

Document database

Wide-column stores

Graph database

Page 15: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Key-value Database

The information is represented by data pairs

Key - value

organized by column indices

Features: using distributed architectures, technologies derived from

the Web, theoretical consistency and integrity constraints relaxed.

Key Value

Prod123 Paper

Prod124 Pencil

Prod125 Rubber

Page 16: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Document Database

They use the same structure as the Key-value, but the values can be

semistructured data and in turn be the subject of research

Examples: XML-based databases that store data as XML

documents.

Chiave (Key) Valore (Value)

Prod123 Product: Pen Price: 2,20, Colour: Blue

Pord124 Product: Pencil, Colour: red

Prod125 Product: Rubber Price: 0,80

Page 17: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Wide Column Database

These databases are based on structures often called Big Table

(derived from Big Table originally used by Google in 2006)

ROW_ID Column Family Title

Prod123 Product Name

Product Price

Product Proce

Product Colour

Time

23

15

16

47

Value

Pen

1,20

2,8 0

Blue

Page 18: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Graph Database

The Graph Database (Angles and Gutierrez 2008) replace the

relational table with graphs that highlight the relationship of one-to

many of the key and value.

Exemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010)

Pluto

Fabriano

Paperino

città

città

Prod123

Nome Prezzo

Colore

Produce

Blocco 1,20

Blu

Page 19: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data collection and population: search engines

Database and Search Engines have two different ways of collecting

and entering information.

Search engine: Crawler (Heydon Najork, 1999), different strategies

for updating (biweekly, monthly rate based on heuristics or quality

of the sites), data collection (incremental, differential).

EVOLUTION: support for API to interface with SBA

Feature Search engine Database

main Crawler Direct write

Pre-processing Not necessary Necessary

Upated data? Almost real time > 24 h

Page 20: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data collection and population: databases

Database: Writing with atomic operations, writing software with

batch ETL (Extract, Transform and Load).

EVOLUTION: NoSQL type databases use their own methods of

search engines (http protocol, crawler)

Feature Search engine Database

main Crawler Direct write

Pre-processing Not necessary Necessary

Upated data? Almost real time > 24 h

Page 21: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data processing (1/2)

Database: Data are mapped, formatted and loaded according to

rigidly defined structures and procedures. When loading the data in

the database they must be validated.

EVOLUTION: some databases are beginning to use natural

language to increase the flexibility of ETL applications.

Feature Search engine Database

Main Natural language Data processing (arith. &

boolean op)

Technology Semantic Data mapping

Page 22: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Data processing (2/2)

Search engines: use the natural language processing (NLP) through

distinct steps:

Identification of the language, tokenization, stemming /

lemmization, tagging, chunking.

EVOLUTION: expansion of the statistical analysis, the ability to

extract semantic information, search semantic relationships

between people, objects, concepts .....

Feature Search engine Database

Main Natural language Data processing (arith. &

boolean op)

Technology Semantic Data mapping

Page 23: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Security, usability, cost and performance

EVOLUTION:

Security in the search engines: confidentiality, integrity,

availability

birth of SBA

Feature Search engine Database

Security Low High

usability Medium/high Low

Data volume Billions of records Milions of records

Response time < 1 sec. Seconds, minutes hours

N. Of users illimited limited

Cost for the user Low High

Page 24: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

The result of the convergence: SBA

SBA is used for the business and the end user: they combine the

advantages of various types of infrastructures:

The ability to identify in detail the relational database data

The scalability, performance and flexibility of data models of

NoSQL type database

The usability and ease of use of search engines on the Web

A framework that uses advanced technologies (eg. Semantic) to

combine structured and unstructured data in such a way that

everything is easier, relevant and understandable to the user.

Page 25: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

When SBA are a good choice?

If it is necessary to aggregate heterogeneous content

If the number of target users is potentially very high

If the data have high volumes

If the information Real time is a must

Highly customized reports without programming SQL

For the democratization of information

Page 26: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Applicability of SBA

Customer service & support

Sales support (CRM; telemarketing)

Logistic

Business Intelligence

Web

Database offloading

Media e social media

Page 27: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Anatomy of an SBA (1/2)

The SBA can be grouped according to the type of content that

process:

Structured Data

• To increase scalability and usability interfaces RDBMS

Unstructured data

• To aggregate content from multiple sources.

Mix of structured and unstructured data.

• For the static or dynamic aggregation of data from multiple sources,

including at least one structured

Page 28: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Anatomy of an SBA(2/2)

Some characteristics:

SBA of structured data: the output of a database query is ordered

relative to an index of materiality is not directly stored in the

database, but calculated SBA

Example: record of products ordered by the order volume,

price, delivery time

SBA on unstructured data or mix:

Summerization, event & fact extraction, sentiment analysis,

multimedia analysis

Page 29: Corso di Comunicazione Digitale Multimedialecsu.unipv.it/wp-content/uploads/2020/01/11_beyond.pdfExemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010) Pluto Fabriano Paperino

Case study: Urbanizer

www.urbanizer.com

It takes different data sources and combines them using techniques

of semantic analysis and sentiment analysis

Database: yellow pages with a list of restaurants

Web and search engines additional information, maps

UGC: comments from customers

It also analyzes the texts trying to extract information about the

user's mood and the "climate" of the room.

Semantic Analytics for Mood-based local search