putting freebase in a star schema

Generation 5 » Putting Freebase in a Star Schema

http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/[1/12/2014 8:17:56 PM]

Putting Freebase in a Star SchemaWhat’s Freebase?

Freebase is a open database of things that exist inthe world: things like people, places, songs andtelevision shows. As of the January 2009 dump, Freebase contained about 241 million facts, andit’s growing all the time. You can browse it via theweb and even edit it, much like Wikipedia. Freebase also has an API that lets programs adddata and make queries using a language calledMQL. Freebase is complementary to DBpedia andother sources of information. Although it takes a different approach to the semanticweb than systems based on RDF standards, it interoperates with them via linkeddata.

The January 2009 Freebase dump is about 500 MB in size. Inside a bzip-compressedfiles, you’ll find something that’s similar in spirit to a Turtle RDF file, but is in asimpler format and represents facts as a collection of four values rather than justthree.

Your Own Personal Freebase

To start exploring and extracting from Freebase, I wanted to load the database into astar schema in a mysql database — an architecture similar to some RDF stores, suchas ARC. The project took about a week of time on a modern x86 server with 4 coresand 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.

This is sufficient for my immediate purposes, but future versions of Freebase promiseto be much larger: this article examines the means that could be used to improveperformance and scalability using parallelism as well as improved data structures andalgorithms.

I’m interested in using generic databases such as Freebase and Dbpedia as a datasource for building web sites. It’s possible to access generic databases through APIs, but there are advantages to having your own copy: you don’t need to worryabout API limits and network latency, and you can ask questions that cover the entireuniverse of discourse.

Many RDF stores use variations of a format known as a Star Schema for representingRDF triples; the Star Schema is commonly used in data warehousing applicationbecause it can efficiently represent repetitive data. Freebase is similar to, but notquite an RDF system. Although there are multiple ways of thinking about Freebase, the quarterly dump files provided by Metaweb are presented as quads: groups of fourrelated terms in tab-delimited terms. To have a platform for exploring freebase, Ibegan a project of loading Freebase into a Star Schema in a relational database.

A Disclaimer

Timings reported in this article are approximate. This work was done on a server thatwas doing other things; little effort was made to control sources of variation such asforeign workload, software configuration and hardware characteristics. I think it’sorders of magnitude that matter here: with much larger data sets becomingavailable, we need tools that can handle databases 10-100 times as big, andquibbling about 20% here or there isn’t so important. I’ve gotten similar results withthe ARC triple store. Some products do about an order of magnitude better: theVirtuoso server can load DBpedia, a larger database, in about 16 to 22 hours on a16 GB computer: several papers on RDF store performance are available [1] [2] [3]. Although the system described in this paper isn’t quite an RDF store, it’s performanceis comprable to a relatively untuned RDF store.

Search for:

ArchivesJune 2012 (1)August 2010 (1)May 2010 (1)June 2009 (2)April 2009 (1)March 2009 (1)February 2009 (3)January 2009 (3)November 2008 (1)August 2008 (2)July 2008 (5)June 2008 (5)May 2008 (2)April 2008 (6)March 2008 (8)June 2006 (1)February 2006 (1)

CategoriesAJAX (2)Asynchronous Communications (16)Biology (1)Books (1)Design (1)Distributed (1)Exceptions (2)Functional Programming (1)GIS (1)Ithaca (1)Japan (1)Math (1)Media (3)Nature (1)Semantic Web (3)Tools (28)

CRUD (1)Dot Net (17)Freebase (2)GWT (9)Java (7)Linq (2)PHP (6)Server Frameworks (1)Silverlight (12)SQL (5)

Uncategorized (1)Web (2)

Analytics (1)

Subscribe to our RSS Feed | About Us

http://gen5.info/q/

http://gen5.info/q

http://www.freebase.com/

http://mql.freebaseapps.com/

http://dbpedia.org/About

http://www.semanticfocus.com/blog/entry/title/freebase-officially-linked-data-with-release-of-rdf-service/

http://www.semanticfocus.com/blog/entry/title/freebase-officially-linked-data-with-release-of-rdf-service/

http://arc.semsol.org/

http://virtuoso.openlinksw.com/wiki/main/Main/VOSRDF

http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

http://www4.wiwiss.fu-berlin.de/benchmarks-200801/

http://esw.w3.org/topic/RdfStoreBenchmarking

http://gen5.info/q/2012/06/

















http://gen5.info/q/category/ajax/

http://gen5.info/q/category/asynchronous-communications/

http://gen5.info/q/category/biology/

http://gen5.info/q/category/books/

http://gen5.info/q/category/design/

http://gen5.info/q/category/distributed/

http://gen5.info/q/category/exceptions/

http://gen5.info/q/category/functional-programming/

http://gen5.info/q/category/gis/

http://gen5.info/q/category/ithaca/

http://gen5.info/q/category/japan/

http://gen5.info/q/category/math/

http://gen5.info/q/category/media/

http://gen5.info/q/category/nature/

http://gen5.info/q/category/semantic-web/

http://gen5.info/q/category/tools/

http://gen5.info/q/category/tools/crud/

http://gen5.info/q/category/tools/dot-net/

http://gen5.info/q/category/tools/freebase/

http://gen5.info/q/category/tools/gwt/

http://gen5.info/q/category/tools/java/

http://gen5.info/q/category/tools/linq/

http://gen5.info/q/category/tools/php/

http://gen5.info/q/category/tools/server-frameworks/

http://gen5.info/q/category/tools/silverlight/

http://gen5.info/q/category/tools/sql/

http://gen5.info/q/category/uncategorized/

http://gen5.info/q/category/web/

http://gen5.info/q/category/web/web_analytics/

http://gen5.info/q/feed/rss/

http://gen5.info/q/feed/rss/

http://gen5.info/q/about/



It took about a week of calendar time to load the 241 million quads in the January2009 Freebase into a Star Schema using a modern 4-core web server with 4GB ofRAM; this time could certainly be improved with microoptimizations, but it’s in thesame range that people are observing that it takes to load 10^8 triples into other RDFstores. (One product is claimed to load DBPedia, which contains about 100 milliontriples, in about 16 hours with “heavy-duty hardware”.) Data sets exceeding 10^9triples are becoming rapidly available — these will soon exceed what can be handledwith simple hardware and software and will require new techniques: both the use ofparallel computers and optimized data structures.

The Star Schema

In a star schema, data is represented in separate fact and dimension tables,

all of the rows in the fact table (quad) contain integer keys — the values associatedwith the keys are defined in dimension tables (cN_value). This efficiently compressesthe data and indexes for the fact table, particularly when the values are highlyrepetitive.

I loaded data into the following schema:create table c1_value ( id integer primary key auto_increment, value text, key(value(255))) type=myisam;

... identical c2_value, c3_value and c4_value tables ...

create table quad ( id integer primary key auto_increment, c1 integer not null, c2 integer not null, c3 integer not null, c4 integer not null) type=myisam;

Although I later created indexes on c1, c2, c3, and c4 in the quad table, I leftunnecessary indexes off of the tables during the loading process because it’s moreefficient to create indexes after loading data in a table. The keys on the value fieldsof the dimension tables are important, because the loading process does frequentqueries to see if values already exist in the dimension table. The sequentiallyassigned id in the quad field isn’t necessary for many applications, but it gives each afact a unique identity and makes the system aware of the order of facts in the dumpfile.

The Loading Process

The loading script was written in PHP and used a naive method to build the indexincrementally. In pseudo code it looked something like this:function insert_quad($q1,$q2,$q3,$q4) { $c1=get_dimension_key(1,$q1); $c2=get_dimension_key(2,$q2); $c3=get_dimension_key(3,$q3); $c4=get_dimension_key(4,$q4); $conn->insert_row("quad",null,$c1,$c2,$c3,$c4)}

function get_dimension_key($index,$value) { $cached_value=check_cache($value); if ($cached_value) return $cached_value;

$table="$c{$index}_value"; $row=$conn->fetch_row_by_value($table,$value); if ($row) return $row->id; $conn->insert_row($table,$value); return $conn->last_insert_id};




Caching frequently used dimension values improves performance by a factor of five orso, at least in the early stages of loading. A simple cache management algorithm, clearing the cache every 500,000 facts, controls memory use. Timing data showsthat a larger cache or better replacement algorithm would make at most an incrementimprovement in performance. (Unless a complete dimension table index can be heldin RAM, in which case all read queries can be eliminated.)

I performed two steps after the initial load:

1. Created indexes on quad(c1), quad(c2), quad(c3) and quad(c4)2. Used myisam table compression to reduce database size and improve

performance

Loading Performance

It took about 140 hours (nearly 6 days) to do the initial load. Here’s a graph of factsloaded vs elapsed time:

The important thing Iabout this graph is that it’s convex upward: the loading processslows down as the number of facts increases. The first 50 quads are loaded at a rateof about 6 million per hour; the last 50 are loaded at a rate of about 1 million perhour. An explanation of the details of the curve would be complex, but log N searchperformance of B-tree indexes and the ability of the database to answer queries out ofthe computer’s RAM cache would be significant. Generically, all databases willperform the same way, becoming progressively slower as the size of the databaseincreases: you’ll eventually reach a database size where the time to load the databasebecomes unacceptable.

The process of constructing b-tree indexes on the mysql tables took most of a day. On average it took about four hours to construct a b-tree index on one column ofquad:mysql> create index quad_c4 on quad(c4);Query OK, 243098077 rows affected (3 hours 40 min 50.03 sec)Records: 243098077 Duplicates: 0 Warnings: 0

It took about an hour to compress the tables and rebuild indexes, at which point thedata directory looks like:-rw-r----- 1 mysql root 8588 Feb 22 18:42 c1_value.frm-rw-r----- 1 mysql root 713598307 Feb 22 18:48 c1_value.MYD-rw-r----- 1 mysql root 557990912 Feb 24 10:48 c1_value.MYI-rw-r----- 1 mysql root 8588 Feb 22 18:56 c2_value.frm-rw-r----- 1 mysql root 485254 Feb 22 18:46 c2_value.MYD-rw-r----- 1 mysql root 961536 Feb 24 10:48 c2_value.MYI-rw-r----- 1 mysql root 8588 Feb 22 18:56 c3_value.frm-rw-r----- 1 mysql root 472636380 Feb 22 18:51 c3_value.MYD-rw-r----- 1 mysql root 370497536 Feb 24 10:51 c3_value.MYI-rw-r----- 1 mysql root 8588 Feb 22 18:56 c4_value.frm-rw-r----- 1 mysql root 1365899624 Feb 22 18:44 c4_value.MYD-rw-r----- 1 mysql root 1849223168 Feb 24 11:01 c4_value.MYI-rw-r----- 1 mysql root 65 Feb 22 18:42 db.opt-rw-rw---- 1 mysql mysql 8660 Feb 23 17:16 quad.frm



-rw-rw---- 1 mysql mysql 3378855902 Feb 23 20:08 quad.MYD-rw-rw---- 1 mysql mysql 9927788544 Feb 24 11:42 quad.MYI

At this point it’s clear that the indexes are larger than the actual databases: note thatc2_value is much smaller than the other tables because it holds a relatively smallnumber of predicate types:mysql> select count(*) from c2_value;+----------+| count(*) |+----------+| 14771 |+----------+1 row in set (0.04 sec)

mysql> select * from c2_value limit 10;+----+-------------------------------------------------------+| id | value |+----+-------------------------------------------------------+| 1 | /type/type/expected_by || 2 | reverse_of:/community/discussion_thread/topic || 3 | reverse_of:/freebase/user_profile/watched_discussions || 4 | reverse_of:/freebase/type_hints/included_types || 5 | /type/object/name || 6 | /freebase/documented_object/tip || 7 | /type/type/default_property || 8 | /type/type/extends || 9 | /type/type/domain || 10 | /type/object/type |+----+-------------------------------------------------------+10 rows in set (0.00 sec)

The total size of the mysql tablespace comes to about 18GB, anexpansion of about40 times relative to the bzip2 compressed dump file.

Query Performance

After all of this trouble, how does it perform? Not too bad if we’re asking a simplequestion, such as pulling up the facts associated with a particular objectmysql> select * from quad where c1=34493;+---------+-------+------+---------+--------+| id | c1 | c2 | c3 | c4 |+---------+-------+------+---------+--------+| 2125876 | 34493 | 11 | 69 | 148106 || 2125877 | 34493 | 12 | 1821399 | 1 || 2125878 | 34493 | 13 | 1176303 | 148107 || 2125879 | 34493 | 1577 | 69 | 148108 || 2125880 | 34493 | 13 | 1176301 | 148109 || 2125881 | 34493 | 10 | 1713782 | 1 || 2125882 | 34493 | 5 | 1174826 | 148110 || 2125883 | 34493 | 1369 | 1826183 | 1 || 2125884 | 34493 | 1578 | 1826184 | 1 || 2125885 | 34493 | 5 | 66 | 148110 || 2125886 | 34493 | 1579 | 1826185 | 1 |+---------+-------+------+---------+--------+11 rows in set (0.05 sec)

Certain sorts of aggregate queries are reasonably efficient, if you don’t need to dothem too often: we can look up the most common 20 predicates in about a minute:select (select value from c2_value as v where v.id=q.c2) as predicate,count(*) from quad as q group by c2 order by count(*) desc limit 20;

+-----------------------------------------+----------+| predicate | count(*) |+-----------------------------------------+----------+| /type/object/type | 27911090 || /type/type/instance | 27911090 || /type/object/key | 23540311 || /type/object/timestamp | 19462011 || /type/object/creator | 19462011 || /type/permission/controls | 19462010 || /type/object/name | 14200072 || master:9202a8c04000641f800000000000012e | 5541319 || master:9202a8c04000641f800000000000012b | 4732113 || /music/release/track | 4260825 || reverse_of:/music/release/track | 4260825 || /music/track/length | 4104120 || /music/album/track | 4056938 || /music/track/album | 4056938 || /common/document/source_uri | 3411482 || /common/topic/article | 3369110 || reverse_of:/common/topic/article | 3369110 || /type/content/blob_id | 1174046 || /type/content/media_type | 1174044 || reverse_of:/type/content/media_type | 1174044 |+-----------------------------------------+----------+20 rows in set (43.47 sec)

You’ve got to be careful how you write your queries: the above query with thesubselect is efficient, but I found it took 5 hours to run when I joined c2_value with



quad and grouped on value. A person who wishes to do frequent aggregate querieswould find it most efficient to create a materialized views of the aggregates.

Faster And Large

It’s obvious that the Jan 2009 Freebase is pretty big to handle with the techniques I’musing. One thing I’m sure of is that that Freebase will be much bigger next quarter —I’m not going to do it the same way again. What can I do to speed the process up?

Don’t Screw Up

This kind of process involves a number of lengthy steps. Mistakes, particularly ifrepeated, can waste days or weeks. Although services such as EC2 are a good wayto provision servers to do this kind of work, the use of automation and carefulprocedures is key to saving time and money.

Partition it

Remember how the loading rate of a data set decreases as the size of the setincrease? If I could split the data set into 5 partitions of 50 M quads each, I couldincrease the loading rate by a factor of 3 or so. If I can build those 5 partitions inparallel (which is trivial), I can reduce wallclock time by a factor of 15.

Eliminate Random Access I/O

This loading process is slow because of the involvement of random access disk I/O. All of Freebase canbe loaded into mysql with the following statement,

LOAD DATA INFILE ‘/tmp/freebase.dat’ INTO TABLE q FIELDS TERMINATED BY ‘\t’;

which took me about 40 minutes to run. Processes that do a “full table scan” on theraw Freebase table with a grep or awk-type pipeline take about 20-30 minutes tocomplete. Dimension tables can be built quickly if they can be indexed by a RAM hasthable. The process that builds the dimension table can emit a list of key valuesfor the associated quads: this output can be sequentially merged to produce the facttable.

Bottle It

Once a data source has been loaded into a database, a physical copy of the databasecan be made and copied to another machine. Copies can be made in the fraction ofthe time that it takes to construct the database. A good example is the Amazon EC2AMI that contains a preinstalled and preloaded Virtuoso database loaded with billionsof triples from DBPedia, MusicBrainz, NeuroCommons and a number of otherdatabases. Although the process of creating the image is complex, a new instancecan be provisioned in 1.5 hours at the click of a button.

Compress Data Values

Unique object identifiers in freebase are coded in an inefficient ASCII representation:mysql> select * from c1_value limit 10;+----+----------------------------------------+| id | value |+----+----------------------------------------+| 1 | /guid/9202a8c04000641f800000000000003b || 2 | /guid/9202a8c04000641f80000000000000ba || 3 | /guid/9202a8c04000641f8000000000000528 || 4 | /guid/9202a8c04000641f8000000000000836 || 5 | /guid/9202a8c04000641f8000000000000df3 || 6 | /guid/9202a8c04000641f800000000000116f || 7 | /guid/9202a8c04000641f8000000000001207 || 8 | /guid/9202a8c04000641f80000000000015f0 || 9 | /guid/9202a8c04000641f80000000000017dc || 10 | /guid/9202a8c04000641f80000000000018a2 |+----+----------------------------------------+10 rows in set (0.00 sec)

These are 38 bytes apiece. The hexadecimal part of the guid could be represented in16 bytes in a binary format, and it appears that about half of the guid is a constantprefix that could be further excised.

A similar efficiency can be gained in the construction of in-memory dimension tables:md5 or sha1 hashes could be used as proxies for values.

The freebase dump is littered with “reverse_of:” properties which are superfluous ifthe correct index structures exist to do forward and backward searches.

Parallelize it

http://virtuoso.openlinksw.com/wiki/main/Main/VirtInstallationEC2

http://virtuoso.openlinksw.com/wiki/main/Main/VirtInstallationEC2

http://virtuoso.openlinksw.com/



Comments (8)

Login

L.G. 0

paul_houle +1

Loading can be parallelized in many ways: for instance, the four dimension tablescan be built in parallel. Dimension tables can also be built by a sorting process thatcan be performed on a computer cluster using map/reduce techniques. A cluster ofcomputers can also store a knowledge base in RAM, trading sequential disk I/O forcommunication costs. Since the availability of data is going to grow faster than thespeed of storage systems, parallelism is going to become essential for handling largeknowledge bases — an issue identified by Japanese AI workers in the early 1980′s.

Cube it?

Some queries benefit from indexes built on combinations of tables, such as

CREATE INDEX quad_c1_c2 ON quad(c1,c2);

there are 40 combinations of columns on which an index could be useful — however, the cost in time and storage involved in creating those indexes would be excessivelyexpensive. If such indexes were indeed necessary, a Multidimensional database cancreate a cube index that is less expensive than a complete set of B-tree indexes.

Break it up into separate tables?

It might be anathema to many semweb enthusiasts, but I think that Freebase (andparts of Freebase) could be efficiently mapped to conventional relational tables. That’sbecause facts in Freebase are associated with types, see, for instance, Composerfrom the Music Commons. It seems reasonable to map types to relational tables andto create satellite tables to represent many-to-many relationships between types. This scheme would automatically partition Freebase in a reasonable way and providean efficient representation where many obvious questions (ex. “Find FemaleComposers Born In 1963 Who Are More Than 65 inches tall”) can be answered with aminimum number of joins.

Conclusion

Large knowledge bases are becoming available that cover large areas of humanconcern: we’re finding many applications for them. It’s possible to to handledatabases such as Freebase and DBpedia on a single computer of moderate size, however, the size of generic databases and the hardware to store them on are goingto grow larger than the ability of a singler computer to process them. Fact stores that(i) use efficient data structures, (ii) take advantage of parallelism, and (iii) can betuned to the requirements of particular applications, are going to be essential forfurther progress in the Semantic Web.

CreditsMetaweb Technologies, Freebase Data Dumps, January 13, 2009Kingsley Idehen, for several links about RDF store performance.Stewart Butterfield for encyclopedia photo.

Paul Houle on February 25th 2009 in Freebase, SQL, Semantic Web

Comments (8)Sort by: Date Rating Last Activity

· 254 weeks ago

1 reply · active 254 weeks ago

· 254 weeks ago

Just use Sphinx:

sphinxsearch.com

Reply

18p

Sphinx looks like a nice product, but it looks like it addresses a different question: full-text search.Freebase is a semantic system where, instead of using an imprecise word like "jaguar", you canreference "jaguar the cat" or "jaguar the game console" by a guid. That said, full-text search can bea useful complement to this kind of system.

Years ago I worked on a project called the Global Performing Arts Database

http://www.glopad.org/

where text about an media item was distributed in literally hundreds of different tables, since thesystem coded statements like

http://intensedebate.com/people/paul_houle


http://www.design-ireland.net/index.php?http%3A//www.design-ireland.net/alpha/controller/view_article.php%3Foid%3D00000000036

http://www.freebase.com/type/schema/music/composer

http://www.freebase.com/view/music

http://download.freebase.com/datadumps/

http://www.openlinksw.com/blog/~kidehen

http://www.flickr.com/photos/stewart/

http://www.flickr.com/photos/stewart/461099066/

http://gen5.info/q/category/tools/freebase/

http://gen5.info/q/category/tools/sql/

http://gen5.info/q/category/semantic-web/

http://www.glopad.org/

http://intensedebate.com/postRSS/12281134



John Sichi 0

paul_houle +1

John Sichi 0

Martin Gajdos +1

Josh Ribakoff 0

madtiger +1

Comment as a Guest, or login:

· 254 weeks ago

2 replies · active 254 weeks ago

· 254 weeks ago

· 254 weeks ago

· 254 weeks ago

· 254 weeks ago

· 252 weeks ago

Post a new comment

"Picture A was taken during a production of Hamlet" "Hamlet was written by Shakespeare"

in machine-readable (RDBMS) form. Of course we wanted Picture A to show up in a picture ofShakespeare, so we had to do a graph traversal of the RDBMS tables to collect any text that mightbe relevant to an item (careful not to follow paths that would lead to irrelevant results.) This waywe'd build up a document vector which we'd index in a conventional full text system. Can Sphinxsupport that kind of thing easily?

Reply

If you're going to the effort to transform to a star, it's worth looking into a column store for the DB aswell. Besides query acceleration and automatic compression, you can also get much faster load timesusing bulk load facilities which avoid row-at-a-time index updates. I won't mention the names of any ofthe contenders here :)

Reply

18p

Got a specific one in mind?

Reply

Well, yeah, LucidDB (http://www.luciddb.org), but since I work on that, I'm biased. :)

Last OSCON, I actually loaded up a WEX dump into LucidDB in between sessions, but the bulk ofit was semi-/un-structured text, which wasn't very interesting since LucidDB doesn't support lobsand fulltext/XML search yet.

Also in the open source camp: Infobright is the lowest learning curve for MySQL users; MonetDBhas XQuery support and is very fast as long as everything fits in memory.

Reply

It would be nice if you could post the scripts you have used for this. I'd love to take a look at those.

Thanks

Reply

Nice work, this will be very interesting to anyone coding knowledge based agents for reasoning aboutthe real world

Reply

17p

I've been parsing through this data for a few days and I stumbled across this page. I agree, do youhave a copy of these scripts, or is it something proprietary to your business?

Reply

Name Email Website (optional)



http://thinkwaitfast.blogspot.com/

http://thinkwaitfast.blogspot.com/

http://joshribakoff.com/

http://joshribakoff.com/

http://intensedebate.com/people/madtiger

http://intensedebate.com/people/madtiger

http://www.luciddb.org),/



Displayed next to your comments. Not displayed publicly. If you have a website, link to it here.

Subscribe to None

Copyright © 2013 Generation 5. WordPress Theme design.

Submit Comment

http://wpthemespot.com/themes/redplanet/

http://www.freecsstemplates.org/

putting freebase in a star schema

Technology