parallelizing the web browser - peoplekrste/papers/parallelbrowser... · parallelizing the web...

6

Click here to load reader

Upload: duongthuy

Post on 08-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

Parallelizing the Web BrowserChristopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav Bodík

Department of Computer Science, University of California, Berkeley{cgjones,rfl,lmeyerov,krste,bodik}@cs.berkeley.edu

Abstract

We argue that the transition from laptops to handheldcomputers will happen only if we rethink the design ofweb browsers. Web browsers are an indispensable partof the end-user software stack but they are too ineffi-cient for handhelds. While the laptop reused the soft-ware stack of its desktop ancestor, solid-state devicetrends suggest that today’s browser designs will not be-come sufficiently (1) responsive and (2) energy-efficient.We argue that browser improvements must go beyondJavaScript JIT compilation and discuss how parallelismmay help achieve these two goals. Motivated by a futurebrowser-based application, we describe the preliminarydesign of our parallel browser, its work-efficient parallelalgorithms, and an actor-based scripting language.

1 Browsers on Handheld ComputersBell’s Law of Computer Classes predicts that handheldcomputers will soon replace and exceed laptops. Indeed,internet-enabled phones have already eclipsed laptops innumber, and may soon do so in input-output capabilityas well. The first phone with a pico-projector (Epoq)launched in 2008; wearable and foldable displays arein prototyping phases. Touch screens keyboards andspeech recognition have become widely adopted. Con-tinuing Bell’s prediction, phone sensors may enable ap-plications not feasible on laptops.

Further evolution of handheld devices is limitedmainly by their computing power. Constrained by thebattery and heat dissipation, the mobile CPU noticeablyimpacts the web browser in particular: even with a fastwi-fi connection, the iPhone may take 20 seconds to loadthe Slashdot home page. Until mobile browsers becomedramatically more responsive, they will continue to beused only when a laptop browser is not available. With-out a fast browser, handhelds may not be able to supportcompelling web applications, such as Google Docs, thatare sprouting in laptop browsers.

One may expect that mobile browsers are network-bound, but this is not the case. On a 2Mbps networkconnection, soon expected on internet phones, the CPUbottleneck becomes visible: On a ThinkPad X61 lap-top, halving the CPU frequency doubles the Firefox pageload times for cached pages; with a cold cache, the

load time is still slowed down by 50%. On an equiva-lent network connection, the iPhone browser is 5 to 10-times slower than Firefox on a fast laptop. The browseris CPU-bound because it is a compiler (for HTML), apage layout engine (for CSS), and an interpreter (forJavaScript); all three tasks are on a user’s critical path.

The successively smaller form factors of previouscomputer generations (workstations, desktops, and thenlaptops) were enabled by exponential improvements inthe performance of single-thread execution. FutureCMOS generations are expected to increase the clockfrequency only marginally, and handheld application de-velopers have already adapted to the new reality: ratherthan developing their applications in the browser, as isthe case on the laptop, they rely on native frameworkssuch as the iPhone SDK (Objective C), Android (Java),or Symbian (C++). These frameworks offer higher per-formance but do so at the cost of portability and pro-grammer productivity.

2 An Efficient Web BrowserWe want to redesign browsers in order to improve their(1) responsiveness and (2) energy efficiency. While ourprimary motivation is the handheld browser, most im-provements will benefit laptop browsers equally. Thereare several ways to achieve the two goals:

• Offloading the computation. Used in the Deepfishand Skyfire mobile browsers, a server-side proxybrowser renders a page and sends compressed im-ages to a handheld. Offloading bulk computations,such as speech recognition, seems attractive, butadding server latencies exceeds the 40ms thresholdfor visual perception, making proxy architecturesinsufficient for interactive GUIs. Disconnected op-eration is inherently impossible.

• Removing the abstraction tax. Browser program-mers pay for their increased productivity with anabstraction tax—the overheads of the page layoutengine, the JavaScript interpreter, parsers for ap-plications deployed in plain text, and other com-ponents. We measured this tax to be two orders ofmagnitude: a small Google Map JavaScript appli-cation runs about 70-times slower than the equiv-alent written using C and pixel frame buffers [3].Removing the abstraction tax is attractive because

Page 2: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

it improves both responsiveness (the program runsfaster) and energy efficiency (the program performsless work). JavaScript overheads can be reducedwith JIT compilation, typically 5 to 10-times, butbrowsers often spend only 15% of time in the inter-preter. Abstraction reduction remains attractive butwe leave it out of this paper.

• Parallelizing the browser. Future CMOS genera-tions will not allow significantly faster clock rates,but they will be about 25% more energy efficientper generation. The savings translate into addi-tional cores, already observed in handhelds. Paral-lelizing the browser improves responsiveness (goal1) and to some extent also energy efficiency (goal2): vector instructions improve energy efficiencyper operation, and although parallelization does notdecrease the amount of work, gains in program ac-celeration may allow us to slow down the clock,further improving the energy efficiency.

The rest of this paper discusses how one may go aboutparallelizing the web browser.

3 What Kind of Parallelism?The performance of the browser is ultimately limited byenergy constraints and these constraints dictate optimalparallelization strategies. Ideally, we want to minimizeenergy consumption while being sufficiently responsive.This goal motivates the following questions:1. Amount of parallelism: Do we decompose thebrowser into 10 or 1000 parallel computations? Theanswer depends primarily on the parallelism available inhandheld processors. In five years, 1W processors areexpected to have about four cores. With two threads percore and 8-wide SIMD instructions, devices may thussupport about 64-way parallelism.

The second consideration comes from voltage scal-ing. The energy efficiency of CMOS transistors in-creases when the frequency and supply voltage are de-creased; parallelism accelerates the program to make upfor the lower frequency. How much more parallelismcan we get? There is a limit to CMOS scaling benefits:reducing clock frequency beyond roughly 3-times fromthe peak frequency no longer improves efficiency lin-early [6]. In the 65nm Intel Tera-scale processor, reduc-ing the frequency from 4Ghz to 1GHz reduces the en-ergy per operation 4-times, while for the ARM9, reduc-ing supply voltage to near-threshold levels and runningon 16 cores, rather on one, reduces energy consumptionby 5-times [15]. It might be possible to design mobileprocessors that operate at the peak frequency allowedby the CMOS technology in short bursts (for example,10ms needed to layout a web page), scaling down thefrequency for the remainder of the execution; the scal-ing may allow them to exploit additional parallelism.

What are the implications of these trends on browserparallelization? Consider the goal of sustained 50-wayparallelization execution. A pertinent question is howmuch of the browser we can afford not to parallelize.Assume that the processor can execute 250 parallel op-erations. Amdahl’s Law shows that to to sustain 50-wayparallelism, only 4% of the browser execution can re-main sequential, with the rest running at 250-way paral-lelism. Obtaining this level of parallelism is challengingbecause, with the exception of media decoding, browseralgorithms have been optimized for single-threaded ex-ecution. This paper suggests that it may be possible touncover 250-way parallelism in the browser.2. Type of parallelism: Should we exploit task paral-lelism, data parallelism, or both? Data parallel architec-tures such as SIMD are efficient because their instruc-tion delivery, which consumes about 50% of energy onsuperscalar processors, is amortized among the paralleloperations. A vector accelerator has been shown to in-crease energy efficiency 10-times [9]. In this paper, weshow that at least a part of the browser can be imple-mented in data parallel fashion.3. Algorithms: Which parallel algorithms improve en-ergy efficiency? Parallel algorithms that accelerate pro-grams do not necessarily improve energy efficiency.The handheld calls for parallel algorithms that are workefficient—i.e., they do not perform more total work thana sequential algorithm. An example of a work-inefficientalgorithm is speculative parallelization that misspecu-lates too often. Work efficiency is a demanding re-quirement: for some “inherently sequential” problems,such as finite-state machines, only work-inefficient al-gorithms are known [5]. We show that careful specula-tion allows work-efficient parallelization of finite statemachines in the lexical analysis.

4 The Browser AnatomyOriginal web browsers were designed to render hyper-linked documents. Later, JavaScript programs, embed-ded in the document, provided simple animated menusby dynamically modifying the document. Today, AJAXapplications rival their desktop counterparts.

The typical browser architecture is shown in Figure 1.Loading an HTML page sets off a cascade of events:The page is scanned, parsed, and compiled into a doc-ument object model (DOM), an abstract syntax tree ofthe document. Content referenced by URLs is fetchedand inlined into the DOM. As the content necessary todisplay the page becomes available, the page layout is(incrementally) solved and drawn to the screen. Afterthe initial page load, scripts respond to events generatedby user input and server messages, typically modifyingthe DOM. This may, in turn, cause the page layout to berecomputed and redrawn.

Page 3: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

Figure 1: The architecture of today’s web browser.

5 Parallelizing the FrontendThe browser frontend compiles an HTML document intoits object model, JavaScript programs into bytecode, andCSS style sheets into rules. These parsing tasks takemore time that we can afford to execute sequentially:Internet Explorer 8 reportedly spends about 3-10% ofits time in parsing [13]; Firefox, according to our mea-surements, spends up to 40% time in parsing. Task-levelparallelization can be achieved by parsing downloadedfiles in parallel. Unfortunately, a page is usually com-prised of a small number of large files, necessitating par-allelization of single-file parsing. Pipelining of lexingand parsing may double the parallelism in HTML pars-ing; the lexical semantics of JavaScript however preventsthe separation of lexing from parsing.

To parallelize single-file text processing, we have ex-plored data parallelism in lexing. We designed thefirst work-efficient parallel algorithm for finite state ma-chines (FSMs), improving on the work-inefficient al-gorithm of Hillis and Steele [5] with novel algorithm-level speculation. The basic idea is to partition the inputstring among n processors. The problem is from whichFSM state should a processor start scanning its stringsegment. Our empirical observation was that, in lex-ing, the automaton arrives to a stable state after a smallnumber of characters, so it is often sufficient to prependto each string segment a small suffix of its left neigh-bor [1]. Once the document has been so partitioned, weobtain n independent lexing tasks which can be vector-ized with the help of a gather operation, which allowsparallel reads from a lookup table. On the CELL proces-sor, our algorithm scales perfectly at least to six cores.To parallelize parsing [14], we have observed that oldsimple algorithms are a more promising starting pointbecause, unlike the LR parser family, they have not been

optimized with sequential execution in mind.

6 Parallel Page LayoutA page is laid out in two steps. Together, these stepsaccount for half of the execution time in IE8 [13]. Inthe first step, the browser associates DOM nodes withCSS style rules. A rule might state that if a paragraph islabeled important and is nested in a box labeled second-column, then the paragraph should use a red font. Ab-stractly, every rule is predicated with a regular expres-sion; these expressions are matched against node names,which are paths from the node to the root. Often, thereare thousands of nodes and thousands of rules. This stepmay take 100–200ms on a laptop for a large page, and amagnitude longer on a handheld.

Matching a node against a rule is independent fromothers such matches. Per-node task partitioning thus ex-poses 1000-way parallelism. However, this algorithmis not work-efficient because tasks redo common work.We have explored caching algorithms and other sequen-tial optimizations, achieving a 6-fold speedup vs. Fire-fox 3 on the Slashdot home page, and then gaining, fromlocality-aware task parallelism, another 5-fold speedupon 8 cores (with perfect scaling up to three cores). Weare now exploring SIMD algorithms to simultaneouslymatch a node against multiple rules or vice versa.

The second step lays out the page elements accord-ing to their styling rules. Like TEX, CSS uses flowlayout. Flow layout rules are inductive in that an ele-ment’s position is computed from how the preceding el-ements have been laid out. Layout engines thus perform(mostly) an in-order walk of the DOM. To parallelizethis sequential traversal, we formalized a kernel of CSSas an attribute grammar and reformulated the layout intofive tree traversal passes, each of which permits paral-lel handling of its heavy leaf tasks (which may involveimage or font computations). Some node styles (e.g.,floats) do not permit breaking sequential dependencies.For these nodes, we utilize speculative parallelization.We evaluated this algorithm with a model that ascribesuniform computation times to document nodes for everyphase; task parallelism (with work stealing) acceleratedthe baseline algorithm 6-fold on an 8-core machine.

7 Parallel ScriptingHere we discuss the rationale for our scripting language.We start with concurrency issues in JavaScript program-ming, follow with a parallelization scenario that we ex-pect to be beneficial. We then outline our actor languageand discuss it briefly in the context of an application.

Concurrency JavaScript offers a simple concurrencymodel: there are neither locks nor threads, and eventhandlers (callbacks) are executed atomically. If the

Page 4: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

DOM is changed by a callback, the page layout is re-computed before the execution handles the next event.This atomicity restriction is insufficient for preventingconcurrency bugs. We have identified two sources ofconcurrency in browser programs—animations and in-teractions with web servers. Let us illustrate how thisconcurrency may lead to bugs in the context of GUI an-imations. Consider a mouse click that makes a windowdisappears with a shrinking animation effect. While thewindow is being animated, the user may keep interactingwith it, for example by hovering over it, causing anotheranimation effect on the window. The two animationsmay conflict and corrupt the window, for example by si-multaneously modifying the dimensions of the window.Ideally, the language should allow avoidance or at leastthe detection of such “animation race” bugs.

Parallelism and shared state To improve responsive-ness of the browser, we need to execute atomic call-backs in parallel. To motivate such parallelization, con-sider programmatic animation of hundreds of elements.Such bulk animation is common in interactive data visu-alization; serializing these animations would reduce theframe rate of the animation.

The browser is allowed to execute callbacks in paral-lel as long as the observed execution appears to handlethe events atomically. To define the commutativity ofcallbacks, we need to define shared state. Our prelimi-nary design aggressively eliminates most shared state byexploiting properties of the browser domain. Actors cancommunicate only through message passing with copysemantics (i.e., pointers and closures cannot be sent).The shared state that may serialize callbacks comes inthree forms. First, dependences among callbacks are in-duced by the DOM. A callback might resize a page com-ponent a, which may in turn lead the layout engine to re-size a component b. A callback listening on the changesto the size of b is executed in response. We discuss belowhow we plan to detect independence of callbacks.

The second shared component is a local database withan interface that allows inexpensive detection of whetherscripts depend on each other through accesses to thedatabase. We have not yet converged on a suitable datanaming scheme; a static hierarchical name space on arelational interface may be sufficient. Efficiently imple-menting general data structures on top of a relational in-terface appears as hard as optimizing SETL programs,but we hope that the web scripting domain comes with afew idioms for which we can specialize.

The third component is the server. If a server is al-lowed to reorder responses, as is the case with webservers today, it appears that it cannot be used to syn-chronize scripts, but concluding this definitively for ourprogramming model requires more work.

The scripting language Both parallelization and de-tection of concurrency bugs require analysis of depen-dences among callbacks. Analyzing JavaScript is how-ever complicated by its programming model, which con-tains several “goto equivalents”. First, data dependencesamong scripts in a web page are unclear because scriptscan communicate indirectly through DOM nodes. Theproblem is that nodes are named with strings values thatmay be computed at run time; the names convey nostructure and need not even be unique. Second, the flowof control is similarly obscured: reactive programmingis implemented with callback handlers; while it may bepossible to analyze callback control flow with flow anal-ysis, the short-running scripts may prevent the more ex-pensive analyses common in Java VMs.

We illustrate JavaScript programming with an exam-ple. The following program draws a box that displaysthe current time and follows the mouse trajectory, de-layed by 500ms. The script references DOM elementswith the string name "box"; the challenge for analysis isto prove that the script modifies the tree only locally.The example also shows the callbacks for the mouseevent and for the timer event that draws the delayed box.These callbacks bind, rather obscurely, the mouse andthe moving box; the challenge for the compiler is to de-termine which parts of the DOM tree are unaffected bythe mouse so that they can be handled in parallel withthe moving box.

<div id="box" style="position:absolute;">Time: <span id="time"> [[code not shown]] </span>

</div><script>document.addEventListener (

’mousemove’,function (e) { // called whenever the mouse moves

var left = e.pageX;var top = e.pageY;setTimeout(function() { // called 500ms later

document.getElementById("box").style.top = top;document.getElementById("box").style.left = left;

}, 500);}, false);

</script>

To simplify the analysis of DOM-carried callback de-pendences, we need to make more explicit (1) the DOMnode naming, so that the compiler can reason about theeffects of callbacks on the DOM; and (2) the scopeof DOM re-layout, so that the compiler understandswhether resizing of a DOM element might resize an-other element on which some callback depends. In ei-ther cases, it may be sufficient to place a conservativeupper bound on the DOM subtree that encapsulates theeffect of the callback.

To identify DOM names at JIT compile time, we pro-pose static, hierarchical naming of DOM elements. Suchnames would allow the JIT compiler to discover already

Page 5: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

during parsing whether two scripts operate on indepen-dent subtrees of the document; if so, the subtrees andtheir scripts can be placed on separate cores.

To simplify the analysis of control flow among call-backs, we turn the flow of control into data flow as inMaxMSP and LabView. The sources of dataflow areevents such as the mouse and server responses, and thesinks are the DOM nodes. Figure 2 shows the JavaScriptprogram as an actor program. The actor program is freedfrom the low-level plumbing of DOM manipulation andcallback setup. Unlike the FlapJax language [10], whichinspired our design, we avoid mixing dataflow and im-perative programming models at the same level; imper-ative programming will be encapsulated by actors thatshare no state (except for the DOM and a local database,as mentioned above).

Figure 2: The same program as an actor program.

To prove absence of dependences carried by the lay-out process (see “Parallelism and shared state”), we willrely on a JIT compiler that will understand the seman-tics of the page layout. The compiler will read the DOMand its styling rules, and will compute a bound on thescope of the incremental layout triggered with a changeto a DOM element. For example, when the dimensionsof a DOM element are fixed by the web page designer,its subtree cannot resize its parent and sibling elements.The layout-carried dependences are thus proven to beconstrained to this subtree, which may in turn prove in-dependence of callbacks, enabling their parallel execu-tion.

A Document as an actor network In this section, weextend programming with actors into the layout process.So far, we used actors in scripts that were attached to theDOM nodes (Figure 2). The DOM nodes were passive,however; their attributes were interpreted by a layout en-gine with fixed layout semantics. We now compute thelayout with actors. Each DOM element will be an ac-tor that will collaborate with its neighbors on laying outthe page. For example, an image nested in a paragraphsends to the paragraph its pixels so that the paragraphcan prepare its own pixel image. During layout compu-tation, the image may influence the size of the paragraphor vice versa, depending on styling rules.

What are the benefits of modeling scripts and layoutuniformly? First, programmable layout rules will allow

modifications to the CSS layout model. For example, anapplication can plug in a math layout system. Second,programmable layout will enable richer transforms, suchas non-linear fish-eye zooms of video frames. This taskis hard in today’s browsers because scripts cannot ac-cess the pixel output of DOM nodes. Finally, the layoutsystem will reuse the backend infrastructure for parallelscript execution, simplifying the infrastructure.

A natural question is whether actors will lend them-selves to an efficient implementation, especially in thedemanding layout computation. We discuss this ques-tion in the context of a hypothetical application for view-ing Olympic events, Mo’lympics, that we have been pro-totyping. An event consists of a video feed, basic infor-mation, and “rich commentary” made by others watch-ing the same feed. Video segments can be recorded, an-notated, and posted into the rich commentary. This richcommentary might be another video, an embedded webpage, or an interactive visualization.

Our preliminary design appears to be implementablewith actors that have limited interaction through explicit,strongly-typed message channels. The actors do notrequire shared state; surprisingly, message passing be-tween actors is sufficient. Once DOM nodes are mod-eled as actors, it makes sense to distinguish betweenthe model actors and view actors. The model domaincontains those script actors that establish the applicationlogic. The view domain comprises the DOM actors andanimation scripts.

In the model domain, messages are user interface andnetwork events, which are on the order of 1000 bytes.These can be transferred in around 10 clock cycles intoday’s STI Cell processor [11] and there are compileroptimizations to improve the locality of chatty actors.When larger messages must be communicated (such asvideo streams or source code), they should be handled byspecialized browser services (video decoder and parser,respectively). It is the copying semantics of messagesthat allows this; such optimizations are well-known [4].

In the view domain, actors appear to transfer largebuffers of rendered display elements through the viewdomain. However, this is a semantic notion; a viewdomain compiler could eliminate this overhead givenknowledge of linear usage. This is another reason forseparating the model and view domains: specializedcompilers can make more intelligent decisions.

We have discussed how static DOM naming andlayout-aware DOM analysis allows analysis of callbackdependences. We have elided the concurrency analysisand a parallelizing compiler built on top of this depen-dence analysis. There are many other interesting ques-tions related to language design, such as how to supportdevelopment of domain abstractions, e.g., for the inputdevices that are likely to appear on the handheld.

Page 6: Parallelizing the Web Browser - Peoplekrste/papers/parallelbrowser... · Parallelizing the Web Browser Christopher Grant Jones, Rose Liu, Leo Meyerovich, Krste Asanovic, Rastislav

8 Related workWe draw linguistic inspiration from the Ptolemy project[2]; functional reactive programming, especially theFlapJax project [10]; and Max/MSP and LabVIEW.We share systems design ideas with the Singularity OSproject [7] and the Viewpoints Research Institute [8].

Parallel algorithms for parsing are numerous butmostly for natural-language parsing [14] or large datasets; we are not aware of efficient parallel parsers forcomputer languages nor parallel layout algorithms thatfit the low-latency/small-task requirements of browsers.

Handhelds as thick clients is not new [12]; current ap-plication platforms pursue this idea (Android, iPhone,maemo). Others view handhelds as thin clients (Deep-Fish, SkyFire).

9 SummaryWeb browsers could turn handheld computers into lap-top replacements, but this vision poses new researchchallenges. We mentioned language and algorithmicdesign problems. There are additional challenges thatwe elided: scheduling independent tasks and provid-ing quality of service guarantees are operating systemproblems; securing data and collaborating effectivelyare language, database, and operating system problems;working without network connectivity are also language,database, and operating system problems; and more. Wehave made promising steps towards solutions, but muchwork remains.

Acknowledgements Research supported by Microsoft(Award #024263 ) and Intel (Award #024894) fundingand by matching funding by U.C. Discovery (Award#DIG07–10227). This material is (also) based uponwork supported under a National Science FoundationGraduate Research Fellowship. Any opinions, findings,conclusions or recommendations expressed in this pub-lication are those of the author(s) and do not necessarilyreflect the views of the National Science Foundation.

References[1] R. Bodik, C. G. Jones, R. Liu, L. Meyerovich, and

K. Asanovic. Browsing web 3.0 on 3.0 watts, 2008.Accessed 2008/10/19.

[2] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt.Ptolemy: a framework for simulating and prototyp-ing heterogeneous systems. pages 527–543, 2002.

[3] J. Galenson, C. G. Jones, and J. Lo. Towardsa Browser OS, December 2008. CS262a CourseProject, UC Berkeley.

[4] A. Ghuloum. Ct: channelling nesl and sisal inc++. In CUFP ’07: Proceedings of the 4th ACM

SIGPLAN workshop on Commercial users of func-tional programming, pages 1–3, New York, NY,USA, 2007. ACM.

[5] W. D. Hillis and J. Guy L. Steele. Data parallelalgorithms. Commun. ACM, 29(12):1170–1183,1986.

[6] M. Horowitz, E. Alon, D. Patil, S. Naffziger,R. Kumar, and K. Bernstein. Scaling, power, andthe future of cmos. In IEEE International ElectronDevices Meeting, Dec 2005.

[7] G. C. Hunt and J. R. Larus. Singularity: rethink-ing the software stack. SIGOPS Oper. Syst. Rev.,41(2):37–49, 2007.

[8] A. Kay, D. Ingalls, Y. Ohshima, I. Piumarta, andA. Raab. Steps toward the reinvention of program-ming. Technical report, National Science Founda-tion, 2006.

[9] C. Lemuet, J. Sampson, J. Francois, and N. Jouppi.The potential energy efficiency of vector accelera-tion. SC Conference, 0:1, 2006.

[10] L. Meyerovich, M. Greenberg, G. Cooper,A. Bromfield, and S. Krishnamurthi. Flapjax,2007. Accessed 2008/10/19.

[11] D. Pham, T. Aipperspach, D. Boerstler, M. Bol-liger, R. Chaudhry, D. Cox, P. Harvey, P. Harvey,H. Hofstee, C. Johns, J. Kahle, A. Kameyama,J. Keaty, Y. Masubuchi, M. Pham, J. Pille,S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki,O. Takahashi, J. Warnock, S. Weitzel, D. Wen-del, and K. Yazawa. Overview of the architecture,circuit design, and physical implementation of afirst-generation cell processor. Solid-State Circuits,IEEE Journal of, 41(1):179–196, Jan. 2006.

[12] T. Starner. Thick clients for personal wireless de-vices. Computer, 35(1):133–135, 2002.

[13] C. Stockwell. What’s coming in ie8, 2008. Ac-cessed 2008/10/19.

[14] M. P. van Lohuizen. Survey of parallel context-free parsing techniques. Parallel and DistributedSystems Reports Series PDS-1997-003, Delft Uni-versity of Technology, 1997.

[15] B. Zhai, R. G. Dreslinski, D. Blaauw, T. Mudge,and D. Sylvester. Energy efficient near-thresholdchip multi-processing. In ISLPED ’07: Proceed-ings of the 2007 international symposium on Lowpower electronics and design, pages 32–37, NewYork, NY, USA, 2007. ACM.