factual presentation for pg west 2010

31
Factual Eric Lui Software Engineer, Data Storage [email protected]

Upload: ericlui

Post on 18-Jul-2015

704 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Factual presentation for pg west 2010

Factual

 

Eric LuiSoftware Engineer, Data [email protected] 

 

Page 2: Factual presentation for pg west 2010

What is Factual.com?

Factual is a platform for sharing, mashing, and publishing open data.

Page 3: Factual presentation for pg west 2010

Crowd-Sourced Data

… is terrific!

• Verifiable• Vote-driven• Customizable

Page 4: Factual presentation for pg west 2010

Demo

Page 5: Factual presentation for pg west 2010

Data Storage

Goal:• 10M tables • 1B rows (summarized)• 10B inputs (or "votes")

 Raw storage• 1TB per input server• 100MB+ per dataset

Page 6: Factual presentation for pg west 2010

What does all this "scale" mean?

Map-Reduce is the right architecture for us:•High volume storage•Scales (with the right design)•Shards and partitions in-place•Minimal downtime•Throwaway intermediary stages

Page 7: Factual presentation for pg west 2010

What does all this "scale" mean?

•Hard to profile•Hard to predict what table will get "hot"•Performance tuning has to be general, unless we're on a Service Level Agreement and can devote DBA resources (not our core strength)•Map-Reduce is not real time

Page 8: Factual presentation for pg west 2010

Data Storage

Challenges • Summarization operations are memory-intensive• N-Way merging is expensive (ie., slow)• Streaming is necessary to serve back full summaries• Common use case is just the first N rows

Page 9: Factual presentation for pg west 2010

Emerging Patterns

• Many Reads• (Relatively) Few New rows• (Very) Few row Updates• Infrequent (< 1 per day) table-wide re-summarizations

Page 10: Factual presentation for pg west 2010

High Availability

Votestore• 3x Redundancy

Page 11: Factual presentation for pg west 2010

High Availability

Problem: Summarization is slow.

Page 12: Factual presentation for pg west 2010

High Availability

Problem: Summarization is slow. Solution: Build a caching layer.

Page 13: Factual presentation for pg west 2010

High Availability

Problem: Summarization is slow. Solution: Build a caching layer.

Cache• 3x Replication• "Dumb" load balancing • Server Affinity (via Zookeeper)

Page 14: Factual presentation for pg west 2010

Metaphor Shear

Why PostgreSQL? Pros• End-user expectations map to RDBMS world• Indexing on common operations

o (ORDER BY, WHERE)• Full-text search• Latitude/longitude/geo functions with PostGIS• Aggregation on summarized results• Built-in persistence

Page 15: Factual presentation for pg west 2010

Metaphor Shear

Why PostgreSQL? Cons• No built-in "versioning"• Re-summarization, though infrequent, is expensive• Need to map lisp-based query language to SQL

Page 16: Factual presentation for pg west 2010

High Availability

Why PostgreSQL? Other considerations• Must pro-actively store attributes• Schema changes are expensive • Handling "upsert" operations is awkward • Deletes are difficult (but infrequent)

• (related) No concept of row merge

Page 17: Factual presentation for pg west 2010
Page 18: Factual presentation for pg west 2010

Demo

Page 19: Factual presentation for pg west 2010

Cache Consistency

ACID? Not really...

High-concurrency 

favored over database-style transactions 

Page 20: Factual presentation for pg west 2010

Cache Consistency

ACID? Not really...

Eventually Consistent

Page 21: Factual presentation for pg west 2010

Consistency Challenges

Cache Invalidation• How do I handle new inputs?

Page 22: Factual presentation for pg west 2010

Consistency Challenges

Cache Invalidation• How do I handle new inputs?

o Shield the Input Store Low-priority - shield the input store Row-level invalidations

o Lazy Fetch updated rows on summary request Leverage postgres to track invalidations

o Decouple From Input API call Async notification

Page 23: Factual presentation for pg west 2010

Consistency Challenges

Cache Instance Management• How do we handle query changes?

o filtering out spam inputso change the aggregation functiono give more weight to table owner's votes

Page 24: Factual presentation for pg west 2010

Consistency Challenges

Cache Instance Management• Simple Re-cache

o Dump the current cached copy, and re-cache.o Slowo Poor user experience

Page 25: Factual presentation for pg west 2010

Consistency Challenges

Cache Instance Management• Better solution: Double Buffering

o Reload new version in backgroundo Continue to serve current table

"closest match" warningo Allow switch-back

Continue to accept invalidations against old table

Page 26: Factual presentation for pg west 2010

Performance

Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders

Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)

Page 27: Factual presentation for pg west 2010

What's next?

Encoding-compliant tablespaces•Support UTF-8, non-Latin sort orders

Select Tables get SSD-based PostgreSQL caching•See Jignesh Shah's terrific slides from PgEast 2009•http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on•20x improvement in random reads (IO pattern for unclustered index reads)•2x improvement on sequential writes (generally pretty smooth)

Page 28: Factual presentation for pg west 2010
Page 29: Factual presentation for pg west 2010

How can I use Factual?

Web UI • Dataset Creation • Workbench

http://www.factual.com/ APIs• Server API

http://wiki.developer.factual.com/FrontPage • Visualizations

http://wiki.developer.factual.com/Factual-Visualization-Documentation

Page 30: Factual presentation for pg west 2010

Questions

Page 31: Factual presentation for pg west 2010

[email protected]

Twitter: @factualinc

http://blog.factual.com