why google fusion tables is not a data integration tool
TRANSCRIPT
Data Management, Integration and Collaboration in the cloud
Gianni Fiore
from Google Labs
3
An experimental Web Service for data management First seen on the web in 2009 Related scientific paper in 2010 Default Google Application from 2011
Main goal: providing seamless data integration over the web
Technology Google search engine for data tables over the web Google Documents: a collaborative, community-based documental base
Realtime collaboration Comment versioning
Google Drive: a cloud storage service Abstract how the data is accessed/stored
Google Visualization API Google Maps
Resources Very large amount of public and private data over the web
Fusion Tables Model Relational Model
Relation as a collection of homogeneous data records Spreadsheet-like data arrangement
Conjunctive queries over data: Select, to filter out a subset of the records Project, to show only properties of interest Join, to enrich data with other relations
Virtually creates new view of the same data Auto-parsing of data types (text, number, datetime or location) Advanced location parsing
Compare
View
Sharing Model Advanced permissions model Invite any people to work with you
As viewers, can see and comment As contributors, can also merge, but not edit As collaborators, can merge and edit as you As owners, can also invite others
Compare
View
Sharing Model Realtime data changes
Cell-based data versioning Cell, row and column based comments
Compare
View
Let’s have an example Three ways to start playing with Fusion Tables:
From your local tables, uploading them From your Google Spreadsheets, converting them in fusion tables From scratch, i.e. creating a brand new table online From web, searching for an existing data table
Pay-as-you-go effort
Compare
View
An example: World cars production in 2011 Search with keyword: cars.
Compare
View
An example: World cars production in 2011 A public fusion table about cars produced in the world.
Compare
View
An example: World cars production in 2011 Overview of the user interface (stylesheets)
Compare
View
An example: World cars production in 2011 Advanced Geocoding
Compare
View
An example: World cars production in 2011 Cards view mode
Compare
View
An example: World cars production in 2011 Pie view mode, a matter of graph configuration
Compare
View
An example: World cars production in 2011 Intensity map view mode
Compare
View
An example: World cars production in 2011 Intensity map view mode
Compare
View
An example: World cars production in 2011 Intensity map view mode
Compare
View
An example: World cars production in 2011 Chart view mode
Compare
View
A join example: World cars production and World population in 2011 Let’s compare with other data around the web: world population
Compare
View
A join example: World cars production and World population in 2011 You can virtually merge together: mapping source column to target
column.
Compare
View
A join example: World cars production and World population in 2011 You choose which columns to project out after the merge. The merged table will be computed dinamically from the two base tables,
using the rows from the first table. Changes to the base tables will be reflected in the merged table and vice versa
Compare
View
Comparing world population and world cars production Compare
View
Comparing world population and world cars production Compare
View
China and India? Compare
View
China? Compare
View
Under the hood
Requests from Google Fusion Tables website Standalone Applications via API Charts and maps from widgets
Formalizing requests into queries
Scheduling a query plan
Triggering query plans
Asking source tables
Under the hood
Column-based Data values versioning Table level transaction management
Under the hood
rows table No PK specified: generated
as <table, row> Indexed and non-indexed
properties
schemas table PK: <table> Columns: table columns, with
their preferred type Access Control Lists, or public
marker
comments table PK: <table, row, col> Comments
property index table PK: <table, row,
prop_name, prop_value> Only keys
Under the hood
Operations Key lookup Prefix scan Range scan
Example 1: retrieving all rows in a table1. Suppose a SQL-like query
select country, rank from cars_prod2. It is transformed into a prefix scan over rows table
Prefix: <cars_prod, ANY>3. Projections are applied to show only columns country and rank
Example 2: filtering out rows from a table1. Suppose a SQL-like query
select country, rank from cars_prod where rank <= 102. It uses the properties index to retrieve which rows to show
Prefix: <cars_prod, ?, rank, 10>3. Projections are applied to show only columns country and rank among
rows selected in rows table by id
Peer-to-peer data exchange and integration No central global schema Data exchange only at import time
No wrappers Same underlying infrastructure
Virtualization mechanism based on query Single [big] table Prefix-based mappings
Autonomy of data sources No logical transparency support No pre-aggregated data, only realtime declarative queries Virtual data federation
Any way to build an ontology on top of Google Fusion Tables? Data inaccuracy Data incompleteness No way to define integrity constraints
No primary keys No inclusion dependencies between tables No exclusion dependencies between tables No foreign keys
Can use as RDBMS? Need a layer on top to guarantee constraints Manual mappings
It’s NoSQL, you can use it at your risk...
Thank You