corso di comunicazione digitale...
TRANSCRIPT
Beyond databases
The increasing convergence between databases and search
engines.
Data models and physical storage
Safety, performance, cost, and usability
SBA (Search based applications)
The case of social media
Architecture of the infrastructure and data
Classification of user-generated content
Convergence Database - Search Engines
What is changing in the management of information on computer?
The boundary between databases and search engines is
disappearing
A new class of software: Search Based Applications (SBA)
Database: creation, storage, maintenance and access to structured
data (discrete units of information and their relationships)
Search engines: search in a document of some information
(usually text) in unstructured data
Definition of SBA
We define a Search-Based Application
A software built on top of a search engine or a database, the
purpose of which is not the simple information retrieval but
access, the analysis and discovery of information mission-oriented
Examples: Customer services
Logistics
targeted Advertising
Decision support
e-Discovery
Fonte: Susan Feldman, IDC LINK, June 2010
The purpose of SBA
The main purpose of the SBA is to overcome the common defects
of IS (databases or search engines):
obsolete data
low usability
Packaging of user requirements
Rigidity of the system
Limited scalability
Example of SBA: sourcier by exalead
Sourcier by exalead
Goals of SBA
The data are exploding thanks to new technologies 'capture' of
data: bar code scanners and Q-code, RFID, GPS, cloud, mobile
computing, virtualization and 3D ..
This has increased the importance of having tools.
Search in huge amounts of data and distributed
Integrate data from different sources (physical and virtual)
Improve ease of access for non-experts
Digital information (2009): 0.8 zetabytes (2020): 35 zetabytes
The information must be made understandable and useful to as
many users as possible.
Search engines - Google and its ancestors: Cleverdon (1950),
Masterman (1958), Salton. SMART (1971). Model: from model to
approximate Full-text search
Database - From the first examples of database (OLTP, Online
Transaction Processing) OLAP (Online Analytical Processing) and
DIS data warehouse. Main problem: consistency
Features Engines Database
Origin Web Enterprise
Main use Information retrieval Transaction processing
Target Unstructured text Structured data
N. Of users Illimited LImited
Data volume Billion of records Million of records
History: Origins and Evolution (1/2)
History: Origins and Evolution (2/2)
Factors that have facilitated the convergence and the emergence of
SBA:
Search engines come in companies (Verity Topics Engine,
1988) http://www.answerw.com/topic/verity-inc
The database must be online (e-commerce)
The main structural and conceptual changes:
Changing patterns in your data
Logical structures of recordings
Procedures for data collection / population
New methods of data processing and data retrieval
Data models and Storage (1/2)
In the early days of the Web, the document was the Web page
(with keywords, descriptive information on the title, author ...
metadata). Today, more generally, a Word document, presentation,
an e-mail.
Features Engine Database
Semantic model Document Relational model
Logical storing structure Index Relation
Representation Not normalized Normalized
Storing architecture Distributed Centralized
Data models and Storage (2/2)
The index of a search engine is the main storage and contains the
information required to locate the information: location in the file
system, URL.
Ability to create copies in cache (compressed copy selecting key
information and associated metadata)
Features Engine Database
Semantic model Document Relational model
Logical storing structure Index Relation
Representation Not normalized Normalized
Storing architecture Distributed Centralized
Example of a full-text search engine
Before you create an index direct (forward index) on the documents
that you want to do research
Then you create an inverted index (inverted index): frequency and
position
Document Words
Doc1 Cane nero salta cancello cane addenta osso
Doc2 piccoli bambini giocano slitta natale
Doc3 Cavalli galoppano cane gioca
ID Termine Documento
1 Cane Doc1: 2: 1, 5 Doc3: 1:3
2 Cavalli Doc3: 1: 1
3 Bambini Doc2: 1: 2
Database
Normalized relational model (Codd, 1990): the relational schema
of a normalized database serves both as a model and as a
conceptual framework for storage
Product ID Producer Product name Price Color
123 Produtt1 Pen 2,20 Blue
124 Produtt2 Pencil 1,20 red
125 Produtt3 Rubber 0.80 yellow
ID_producer Name Address
Produtt1 Paperino Fabriano, via mascagni 35, tel 0…
Produtt2 Pluto Milano, Viale Zara 302…
What has changed recently: Engines
Evolve the concept of document: not only text but much more.
This creates a more complex version of the document
Text Document Figurative Document
Document attributes
Date
Title
Author
Content
Security
Language
File dimension
Document attributes
DateModified
Source
Content(product_ID,
Product_name, Price,
Colour)
Security
Language…
What has changed recently: Database
Leave the relational database model based on normalized tables of
rows with information. Change the data model and storage.
Born NoSQL database type (Leavitt 2010) or VLDB (Very Large
Databases):
Key-value
Document database
Wide-column stores
Graph database
Key-value Database
The information is represented by data pairs
Key - value
organized by column indices
Features: using distributed architectures, technologies derived from
the Web, theoretical consistency and integrity constraints relaxed.
Key Value
Prod123 Paper
Prod124 Pencil
Prod125 Rubber
Document Database
They use the same structure as the Key-value, but the values can be
semistructured data and in turn be the subject of research
Examples: XML-based databases that store data as XML
documents.
Chiave (Key) Valore (Value)
Prod123 Product: Pen Price: 2,20, Colour: Blue
Pord124 Product: Pencil, Colour: red
Prod125 Product: Rubber Price: 0,80
Wide Column Database
These databases are based on structures often called Big Table
(derived from Big Table originally used by Google in 2006)
ROW_ID Column Family Title
Prod123 Product Name
Product Price
Product Proce
Product Colour
Time
23
15
16
47
Value
Pen
1,20
2,8 0
Blue
Graph Database
The Graph Database (Angles and Gutierrez 2008) replace the
relational table with graphs that highlight the relationship of one-to
many of the key and value.
Exemples: Neo4j (Vicknair, 2010), Infogrid (Giannadakis, 2010)
Pluto
Fabriano
Paperino
città
città
Prod123
Nome Prezzo
Colore
Produce
Blocco 1,20
Blu
Data collection and population: search engines
Database and Search Engines have two different ways of collecting
and entering information.
Search engine: Crawler (Heydon Najork, 1999), different strategies
for updating (biweekly, monthly rate based on heuristics or quality
of the sites), data collection (incremental, differential).
EVOLUTION: support for API to interface with SBA
Feature Search engine Database
main Crawler Direct write
Pre-processing Not necessary Necessary
Upated data? Almost real time > 24 h
Data collection and population: databases
Database: Writing with atomic operations, writing software with
batch ETL (Extract, Transform and Load).
EVOLUTION: NoSQL type databases use their own methods of
search engines (http protocol, crawler)
Feature Search engine Database
main Crawler Direct write
Pre-processing Not necessary Necessary
Upated data? Almost real time > 24 h
Data processing (1/2)
Database: Data are mapped, formatted and loaded according to
rigidly defined structures and procedures. When loading the data in
the database they must be validated.
EVOLUTION: some databases are beginning to use natural
language to increase the flexibility of ETL applications.
Feature Search engine Database
Main Natural language Data processing (arith. &
boolean op)
Technology Semantic Data mapping
Data processing (2/2)
Search engines: use the natural language processing (NLP) through
distinct steps:
Identification of the language, tokenization, stemming /
lemmization, tagging, chunking.
EVOLUTION: expansion of the statistical analysis, the ability to
extract semantic information, search semantic relationships
between people, objects, concepts .....
Feature Search engine Database
Main Natural language Data processing (arith. &
boolean op)
Technology Semantic Data mapping
Security, usability, cost and performance
EVOLUTION:
Security in the search engines: confidentiality, integrity,
availability
birth of SBA
Feature Search engine Database
Security Low High
usability Medium/high Low
Data volume Billions of records Milions of records
Response time < 1 sec. Seconds, minutes hours
N. Of users illimited limited
Cost for the user Low High
The result of the convergence: SBA
SBA is used for the business and the end user: they combine the
advantages of various types of infrastructures:
The ability to identify in detail the relational database data
The scalability, performance and flexibility of data models of
NoSQL type database
The usability and ease of use of search engines on the Web
A framework that uses advanced technologies (eg. Semantic) to
combine structured and unstructured data in such a way that
everything is easier, relevant and understandable to the user.
When SBA are a good choice?
If it is necessary to aggregate heterogeneous content
If the number of target users is potentially very high
If the data have high volumes
If the information Real time is a must
Highly customized reports without programming SQL
For the democratization of information
Applicability of SBA
Customer service & support
Sales support (CRM; telemarketing)
Logistic
Business Intelligence
Web
Database offloading
Media e social media
Anatomy of an SBA (1/2)
The SBA can be grouped according to the type of content that
process:
Structured Data
• To increase scalability and usability interfaces RDBMS
Unstructured data
• To aggregate content from multiple sources.
Mix of structured and unstructured data.
• For the static or dynamic aggregation of data from multiple sources,
including at least one structured
Anatomy of an SBA(2/2)
Some characteristics:
SBA of structured data: the output of a database query is ordered
relative to an index of materiality is not directly stored in the
database, but calculated SBA
Example: record of products ordered by the order volume,
price, delivery time
SBA on unstructured data or mix:
Summerization, event & fact extraction, sentiment analysis,
multimedia analysis
Case study: Urbanizer
www.urbanizer.com
It takes different data sources and combines them using techniques
of semantic analysis and sentiment analysis
Database: yellow pages with a list of restaurants
Web and search engines additional information, maps
UGC: comments from customers
It also analyzes the texts trying to extract information about the
user's mood and the "climate" of the room.
Semantic Analytics for Mood-based local search