Download - Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

ZFS & Zones:Your Compute fell intoMy Data!SVP, Engineering

[email protected]

Bryan Cantrill

@bcantrill

mailto:[email protected]


The filesystem: Some prehistory

• When they were originally developed in the 1970s, filesystems were designed as an abstraction over a disk

• Over time, it became increasingly expensive to make bigger disks — and reliability suffered

• In the 1980s, both problems were solved by using many hard-drives instead of just larger and large drives: a redundant array of inexpensive disks (RAID)

• Even though filesystems were still relatively young at the time, it was deemed too complicated to rewrite them to accommodate the (new) notion of many disks

• This software problem was solved by introducing a new layer of software: the volume manager

The volume management divide

• Volume management abstracts many physical devices into single logical volumes, allowing filesystems retained a one-to-one mapping with a device (a logical one)

• This gave rise to a problematic divide:

• The volume manager understands multiple disks, but nothing of the higher level semantics of the filesystem

• The filesystem understands the higher semantics of the data, but has no physical device understanding

• This divide became entrenched over the 1990s, and had devastating ramifications for reliability, performance and manageability

Volume management deficiencies

• Because the volume management layer had no notion of the transactional semantics of the filesystem, system failure induced excruciating file system checks

• Worse, the system was left with no protection against many variants of device-level data corruption:

• The only failure the volume manager can reasonably detect is media failure that results in incorrect data on disk

• This doesn’t account for phantom reads (i.e., the wrong disk block is read from), phantom writes (i.e., the wrong disk block is written to) or driver pathologies (e.g. memory errors)

• And because they did not understand more than one device, device failure often meant filesystem failure

Volume management deficiencies

• Lacking visibility into the hardware layer, the filesystem could not effectively use the parallelism inherent in multiple disks — and could not effectively schedule I/O

• Spindles were underutilized (leaving bandwidth and/or IOPS on the table) or overutilized (thrashing the device and yielding pathological performance

• Management was a nightmare: filesystems could not be expanded or shrunk — requiring every filesystem to know in advance its intended capacity

The ZFS revolution

• Starting in 2001, Sun began a revolutionary new software effort: to unify storage and eliminate the divide

• In this model, filesystems would lose their one-to-one association with devices: many filesystems would be multiplexed on many devices

• By starting with a clean sheet of paper, ZFS opened up vistas of innovation — and by its architecture was able to solve many otherwise intractable problems

• Sun shipped ZFS in 2005, and used it as the foundation of its enterprise storage products starting in 2008

• ZFS was open sourced in 2005; it remains the only open source enterprise-grade filesystem

ZFS advantages

• Copy-on-write design allows on-disk consistency to be always assured (eliminating file system check)

• Copy-on-write design allows constant-time snapshots in unlimited quantity — and writable clones!

• Filesystem architecture allows filesystems to be created instantly and expanded — or shrunk! — on-the-fly

• Integrated volume management allows for intelligent device behavior with respect to disk failure and recovery

• Adaptive replacement cache (ARC) allows for optimal use of DRAM — especially on high DRAM systems

• Support for dedicated log and cache devices allows for optimal use of flash-based SSDs

ZFS at Joyent

• Joyent was the earliest ZFS adopter: becoming (in 2005) the first production user of ZFS outside of Sun

• ZFS is one of the four foundational technologies of Joyent’s SmartOS, our illumos derivative

• The other three foundational technologies in SmartOS are DTrace, Zones and KVM

• Search “fork yeah illumos” for the (uncensored) history of OpenSolaris, illumos, SmartOS and derivatives

• Joyent has extended ZFS to provide better support multi-tenant operation with I/O throttling

ZFS as the basis for object storage?

• We view ZFS as our most foundational differentiator...

• As we began to think about building our own internet facing object store in the fall of 2011, we naturally gravitated to ZFS...

• Could we extend ZFS in some important way that would offer something interesting and compelling?

• Short answer: meh

• Operating a public cloud has significant technological and business challenges:

• From a technological perspective, must deliver highly elastic infrastructure with acceptable quality of service across a broad class of users and applications

• From a business perspective, must drive utilization as high as possible while still satisfying customer expectations for quality of service

• These aspirations are in tension: multi-tenancy can significantly degrade quality of service

• The key enabling technology for multi-tenancy is virtualization — but where in the stack to virtualize?

Aside: Virtualization in the cloud

• The historical answer — since the 1960s — has been to virtualize at the level of the hardware:

• A virtual machine is presented upon which each tenant runs an operating system of their choosing

• There are as many operating systems as tenants

• The historical motivation for hardware virtualization remains its advantage today: it can run entire legacy stacks unmodified

• However, hardware virtualization exacts a heavy tolls: operating systems are not designed to share resources like DRAM, CPU, I/O devices or the network

• Hardware virtualization limits tenancy and inhibits performance!

Hardware-level virtualization?

• Virtualizing at the application platform layer addresses the tenancy challenges of hardware virtualization…

• ...but at the cost of dictating abstraction to the developer

• This creates the “Google App Engine problem”: developers are in a straightjacket where toy programs are easy — but sophisticated apps are impossible

• Virtualizing at the application platform layer poses many other challenges:

• Security, resource containment, language specificity, environment-specific engineering costs

Platform-level virtualization?

• Virtualizing at the OS level hits the sweet spot:

• Single OS (single kernel) allows for efficient use of hardware resources, and therefore allows load factors to be high

• Disjoint instances are securely compartmentalized by the operating system

• Gives customers what appears to be a virtual machine (albeit a very fast one) on which to run higher-level software

• Gives customers PaaS when the abstractions work for them, IaaS when they need more generality

• OS-level virtualization allows for high levels of tenancy without dictating abstraction or sacrificing efficiency

• Zones is a bullet-proof implementation of OS-level virtualization — and is the core abstraction in Joyent’s SmartOS

Joyent’s solution: OS-level virtualization

Idea: ZFS + Zones?

• Building a sophisticated distributed system on top of ZFS and zones, we have built Manta, an internet-facing object storage system offering in situ compute

• That is, the description of compute can be brought to where objects reside instead of having to backhaul objects to transient compute

• The abstractions made available for computation are anything that can run on the OS...

• ...and as a reminder, the OS — Unix — was built around the notion of ad hoc unstructured data processing, and allows for remarkably terse expressions of computation

Manta: ZFS + Zones!

Aside: Unix

• When Unix appeared in the early 1970s, it was not just a new system, but a new way of thinking about systems

• Instead of a sealed monolith, the operating system was a collection of small, easily understood programs

• First Edition Unix (1971) contained many programs that we still use today (ls, rm, cat, mv)

• Its very name conveyed this minimalist aesthetic: Unix is a homophone of “eunuchs” — a castrated Multics

We were a bit oppressed by the big system mentality. Ken wanted to do something simple. — Dennis Ritchie

Unix: Let there be light

• In 1969, Doug McIlroy had the idea of connecting different components:

At the same time that Thompson and Ritchie were sketching out a file system, I was sketching out how to do data processing on the blackboard by connecting together cascades of processes

• This was the primordial pipe, but it took three years to persuade Thompson to adopt it:

And one day I came up with a syntax for the shell that went along with the piping, and Ken said, “I’m going to do it!”

Unix: ...and there was light

And the next morning we had this orgy of one-liners. — Doug McIlroy

The Unix philosophy

• The pipe — coupled with the small-system aesthetic — gave rise to the Unix philosophy, as articulated by Doug McIlroy:

• Write programs that do one thing and do it well

• Write programs to work together

• Write programs that handle text streams, because that is a universal interface

• Four decades later, this philosophy remains the single most important revolution in software systems thinking!

• In 1986, Jon Bentley posed the challenge that became the Epic Rap Battle of computer science history:

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

• Don Knuth’s solution: an elaborate program in WEB, a Pascal-like literate programming system of his own invention, using a purpose-built algorithm

• Doug McIlroy’s solution shows the power of the Unix philosophy:

tr -cs A-Za-z '\n' | tr A-Z a-z | \ sort | uniq -c | sort -rn | sed ${1}q

Doug McIlroy v. Don Knuth: FIGHT!

Big Data: History repeats itself?

• The original Google MapReduce paper (Dean et al., OSDI ’04) poses a problem disturbingly similar to Bentley’s challenge nearly two decades prior:

Count of URL Access Frequency: The function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair

• But the solutions do not adhere to the Unix philosophy...

• ...and nor do they make use of the substantial Unix foundation for data processing

• e.g., Appendix A of the OSDI ’04 paper has a 71 line word count in C++ — with nary a wc in sight

• Manta allows for an arbitrarily scalable variant of McIlroy’s solution to Bentley’s challenge: mfind -t o /bcantrill/public/v7/usr/man | \ mjob create -o -m "tr -cs A-Za-z '\n' | \ tr A-Z a-z | sort | uniq -c" -r \ "awk '{ x[\$2] += \$1 } END { for (w in x) { print x[w] \" \" w } }' | \ sort -rn | sed ${1}q"

• This description not only terse, it is high performing: data is left at rest — with the “map” phase doing heavy reduction of the data stream

• As such, Manta — like Unix — is not merely syntactic sugar; it converges compute and data in a new way

Manta: Unix for Big Data

• Eventual consistency represents the wrong CAP tradeoffs for most; we prefer consistency over availability for writes (but still availability for reads)

• Many more details:http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/

• Celebrity endorsement:

Manta: CAP tradeoffs

http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/

http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/

• Hierarchical storage is an excellent idea (ht: Multics); Manta implements proper directories, delimited with a forward slash

• Manta implements a snapshot/link hybrid dubbed a snaplink; can be used to effect versioning

• Manta has full support for CORS headers

• Manta uses SSH-based HTTP auth for client-side tooling (IETF draft-cavage-http-signatures-00)

• Manta SDKs exist for node.js, Java, Ruby, Python

• “npm install manta” for command line interface

Manta: Other design principles

• We believe compute/data convergence to be the future of big data: stores of record must support computation as a first-class, in situ operation

• We believe that Unix is a natural way of expressing this computation — and that the OS is the right level at which to virtualize to support this securely

• We believe that ZFS is the only sane storage underpinning for such a system

• Manta will surely not be the only system to represent the confluence of these — but it is the first

• We are actively retooling our software stack in terms of Manta — Manta is changing the way we develop software!

Manta and the future of big data

• Product page:

http://joyent.com/products/manta

• node.js module:

https://github.com/joyent/node-manta

• Manta documentation:

http://apidocs.joyent.com/manta/

• IRC, e-mail, Twitter, etc.:

#manta on freenode, [email protected], @mcavage, @dapsays, @yunongx, @joyent

Manta: More information









Download - Manta: a new internet-facing object storage facility that features compute by Bryan Cantrill

Top Related