why google fusion tables is not a data integration tool

Data Management, Integration and Collaboration in the cloud

Gianni Fiore

from Google Labs

3

An experimental Web Service for data management First seen on the web in 2009 Related scientific paper in 2010 Default Google Application from 2011

Main goal: providing seamless data integration over the web

Technology Google search engine for data tables over the web Google Documents: a collaborative, community-based documental base

Realtime collaboration Comment versioning

Google Drive: a cloud storage service Abstract how the data is accessed/stored

Google Visualization API Google Maps

Resources Very large amount of public and private data over the web

Fusion Tables Model Relational Model

Relation as a collection of homogeneous data records Spreadsheet-like data arrangement

Conjunctive queries over data: Select, to filter out a subset of the records Project, to show only properties of interest Join, to enrich data with other relations

Virtually creates new view of the same data Auto-parsing of data types (text, number, datetime or location) Advanced location parsing

Compare

View

Sharing Model Advanced permissions model Invite any people to work with you

As viewers, can see and comment As contributors, can also merge, but not edit As collaborators, can merge and edit as you As owners, can also invite others

Compare

View

Sharing Model Realtime data changes

Cell-based data versioning Cell, row and column based comments

Compare

View

Let’s have an example Three ways to start playing with Fusion Tables:

From your local tables, uploading them From your Google Spreadsheets, converting them in fusion tables From scratch, i.e. creating a brand new table online From web, searching for an existing data table

Pay-as-you-go effort

Compare

View

An example: World cars production in 2011 Search with keyword: cars.

Compare

View

An example: World cars production in 2011 A public fusion table about cars produced in the world.

Compare

View

An example: World cars production in 2011 Overview of the user interface (stylesheets)

Compare

View

An example: World cars production in 2011 Advanced Geocoding

Compare

View

An example: World cars production in 2011 Cards view mode

Compare

View

An example: World cars production in 2011 Pie view mode, a matter of graph configuration

Compare

View

An example: World cars production in 2011 Intensity map view mode

Compare

View

An example: World cars production in 2011 Chart view mode

Compare

View

A join example: World cars production and World population in 2011 Let’s compare with other data around the web: world population

Compare

View

A join example: World cars production and World population in 2011 You can virtually merge together: mapping source column to target

column.

Compare

View

A join example: World cars production and World population in 2011 You choose which columns to project out after the merge. The merged table will be computed dinamically from the two base tables,

using the rows from the first table. Changes to the base tables will be reflected in the merged table and vice versa

Compare

View

Comparing world population and world cars production Compare

View

China and India? Compare

View

China? Compare

View

Under the hood

Requests from Google Fusion Tables website Standalone Applications via API Charts and maps from widgets

Formalizing requests into queries

Scheduling a query plan

Triggering query plans

Asking source tables

Under the hood

Column-based Data values versioning Table level transaction management

Under the hood

rows table No PK specified: generated

as <table, row> Indexed and non-indexed

properties

schemas table PK: <table> Columns: table columns, with

their preferred type Access Control Lists, or public

marker

comments table PK: <table, row, col> Comments

property index table PK: <table, row,

prop_name, prop_value> Only keys

Under the hood

Operations Key lookup Prefix scan Range scan

Example 1: retrieving all rows in a table1. Suppose a SQL-like query

select country, rank from cars_prod2. It is transformed into a prefix scan over rows table

Prefix: <cars_prod, ANY>3. Projections are applied to show only columns country and rank

Example 2: filtering out rows from a table1. Suppose a SQL-like query

select country, rank from cars_prod where rank <= 102. It uses the properties index to retrieve which rows to show

Prefix: <cars_prod, ?, rank, 10>3. Projections are applied to show only columns country and rank among

rows selected in rows table by id

Peer-to-peer data exchange and integration No central global schema Data exchange only at import time

No wrappers Same underlying infrastructure

Virtualization mechanism based on query Single [big] table Prefix-based mappings

Autonomy of data sources No logical transparency support No pre-aggregated data, only realtime declarative queries Virtual data federation

Any way to build an ontology on top of Google Fusion Tables? Data inaccuracy Data incompleteness No way to define integrity constraints

No primary keys No inclusion dependencies between tables No exclusion dependencies between tables No foreign keys

Can use as RDBMS? Need a layer on top to guarantee constraints Manual mappings

It’s NoSQL, you can use it at your risk...

Thank You

why google fusion tables is not a data integration tool

Data & Analytics