evaluation of the suitability of alluxio for hadoop processing … · 2016-09-16 · software:...

CERN Summer Student Program 2016

Evaluation of the Suitability ofAlluxio for Hadoop Processing

Frameworks

Author:

Christopher Lawrie

University of Warwick

[email protected]

Supervisor:

Prasanth Kothuri

CERN IT-DB

[email protected]

September 16, 2016

1 Introduction

1.1 What is Alluxio?

Alluxio [1] is an open source memory speed virtual distributed storage platform. It sits between

the storage and processing framework layers for big data processing and claims to heavily improve

performance when data is required to be written/read at a high throughput; for example when a

dataset is used by many jobs simultaneously. This report evaluates the viability of using Alluxio

at CERN for Hadoop processing frameworks.

1.2 Hardware and Software Configuration

Throughout the project a cluster with 6 nodes with the following configuration was used to test

Alluxio:

Memory (per node): 7 GB

CPU (per node): 2.4GHz (4 cores)

OS: CentOS Linux release 7.2.1511 (Core)

Software: Cloudera Manager: 5.8.1, Spark: 1.6.0, Hadoop: 2.6.0-cdh5.7.1, Alluxio: 1.2.0.

2 Installation

It is assumed that one already has a cluster setup which is running Cloudera Manager and all of

the appropriate software installed to test Alluxio (listed above). To install Alluxio on a cluster

with HDFS as the UnderFS storage (under storage filesystem) one must follow a combination

of two guides [2, 3]. The “alluxio” user will be used to install Alluxio and should have access

to run sudo commands. It should be assumed that where appropriate in the installation process

that commands were run as the alluxio user.

2.1 Obtain Alluxio

On each node of the cluster one must download and extract Alluxio. Alluxio can be found in

the downloads directory of their website [4]. I was careful to ensure that Alluxio was extracted

into the home directory of the alluxio user (/opt/alluxio), and furthermore that this user had

correct ownership/access to all Alluxio files and directories.

1

2.2 Build Alluxio

Alluxio with a hdfs as the underFS must be built with the correct version of Hadoop on each

machine. First, change the Hadoop version in the “pom.xml” file located in the root the Alluxio

directory. Find the “hadoop.version” tag and change this to the correct value. Then build

Alluxio from the home directory:

$ hadoop version

$ vim alluxio home/pom.xml # Change ”hadoop.version” tag

$ cd alluxio home && sudo mvn clean package −DskipTests

2.3 Quick Configuration

Alluxio can be configured on the master machine and these changes then copied over to the

workers. To quickly configure Alluxio run the following in the Alluxio directory of the master

machine:

$ ./bin/alluxio bootstrapConf $HOSTNAME hdfs

On the master machine in the alluxio-env.sh configuration file, list the address for the HDFS

running on the cluster and specify the Alluxio mount point within this HDFS (the port that the

HDFS is running can be found in cloudera manager as the value of “namenode port”). We must

also specify the volume of memory to assign to Alluxio on each worker.

$ vim ”alluxio path/conf/alluxio−env.sh”

ALLUXIO UNDERFS ADDRESS=${ALLUXIO UNDERFS ADDRESS:−”hdfs://master

address:hdfs port/alluxio mount point”}ALLUXIO WORKER MEMORY SIZE=${ALLUXIO WORKER MEMORY SIZE:−”2000MB”}

For testing purposes, Alluxio was mounted to “/alluxio” on the HDFS.

Again on the master machine, we must list the worker hostnames in the workers configuration

file:

Finally, copy this configuration to each of the workers by running from the Alluxio home

directory on the master machine:

$ ./bin/alluxio copyDir conf

2

Note that this will require passwordless SSH communication between the root user on the master

and the root user on each of the workers and the alluxio user on the master and the alluxio user

on each of the workers. For my particular setup I also had to disable the requiretty option for

each machine.

2.4 Allow Alluxio to access the HDFS

Alluxio runs its filesystem commands as the root user. We must give alluxio operations full

access to the HDFS directory where Alluxio is mounted. First on the cloudera manager enable

the configuration option “dfs.namenode.acls.enabled”. Then run the following on the namenode:

$ sudo −u hdfs hadoop dfs −mkdir ”/alluxio”

$ sudo −u hdfs hadoop dfs −chown −R root:supergroup ”/alluxio”

2.5 Start Alluxio

We can add the Alluxio commands to our “PATH” environment variable:

$ export PATH=$PATH:alluxio home/bin

Alluxio is now ready to be started. Run the following commands:

$ alluxio format

$ alluxio−start.sh all Mount

This formats the Alluxio mount point and starts the master and all workers. To check alluxio is

running you can go to “http://master hostname:1999”. Here you will find the Alluxio web UI.

Occasionally there can be discrepencies between files in the underFS and the Alluxio filesys-

tem due to files not being correctly removed. This should not occur if permissions have been

correctly configured. Often to resolve this issue one must format the Alluxio filesystem.

2.6 Quick Boot with Shell Script

Best practice for tweaking Alluxio and rebooting is to make any changes on the master and then

copy those changes over to the workers. A shell script can be written to quickly apply these

changes and restart Alluxio.

$ cd /opt/alluxio/alluxio−1.2.0

$ alluxio copyDir conf

$ alluxio−stop.sh all

$ alluxio−start.sh all Mount

3

2.7 Alluxio and Spark Hostname Issue

There is a documented issue which concerns how Spark and Alluxio interact with each other on

nodes [5]. The issue is concerned with the fact that Spark often uses IP addresses rather than

hostnames to reference workers, whereas Alluxio uses hostnames. The easiest solution is to force

Spark to use hostnames instead. This is achieved by including an extra line in the spark-env.sh

config file of each machine in the Spark cluster:

$ vim spark home/conf/spark−env.sh

SPARK LOCAL HOSTNAME=‘hostname‘

This change should be made on the Cloudera Manager web UI if the cluster is managed by

Cloudera.

One can check that there is no issue between Spark and Alluxio by running a Spark Job

which uses a file in Alluxio (for example run a count on a file in Alluxio). Then in the job history

webUI for Spark, check the locality level of the executions. If there is no issue then it should

always read “NODE LOCAL” as apposed to something like “RACK LOCAL”.

3 Alluxio Features

3.1 Further Configuration Options

Alluxio has many configuration options [6]. The two main configuration files are located at “al-

luxio home/conf/alluxio-env.sh” and “alluxio home/conf/alluxio-site.properties”. Each of these

files have templates which show which settings that can be changed from within them.

3.1.1 Example: Round-Robin Scheduling

To change the scheduling system for writing files to worker memory from its default value, a line

must be added to the ”alluxio-site.properties” file:

$ vim alluxio home/conf/alluxio−site.properties

alluxio .user. file .write. location .policy . class=alluxio. client . file .policy .RoundRobinPolicy

Upon restarting Alluxio and loading a file into memory the file will be distributed across workers

in a round-robin fashion.

4

3.1.2 Example: UnderFS Synchronisation

When writing files to Alluxio directly we can choose how Alluxio and the UnderFS storage

interact with each other. By default the parameter “alluxio.user.file. writetype.default” is set to

“MUST CACHE”. With this value a file written directly to Alluxio will not be written to the

underFS. We can change the value of this parameter to “CACHE THROUGH”.


alluxio .user. file .writetype.default=CACHE THROUGH

$ cd alluxio home && alluxio copyDir conf # Sync changes.

Now when a file is written directly to Alluxio it is synchonously written to the underFS.

Alluxio can also write files asynchronously to the underFS, this feature is in alpha but is

enabled by setting “alluxio.user.file.writetype.default” to “ASYNC THROUGH”.

3.2 Command Line Interface

Alluxio can be used via a library of built in command line tools which have a very similar

structure to HDFS commands [7]. These tools work straight away after installing Alluxio. To

see a full list of them with descriptions run:

$ alluxio fs

3.2.1 Example: Loading a File into Alluxio

We can either load a file directly into Alluxio memory, or into the underFS out of memory but

such that it is still visible to Alluxio.

To load directly into memory we can use:

$ alluxio fs copyFromLocal /path to local file /alluxio storage path

We can see that the file is now visible to Alluxio and that the file has been distributed

across the worker memory. The writing of the file into Alluxio is done according to the “write-

5

type.default” configuration.

To load the file out of memory we can use:

$ hadoop dfs −put /path to file /alluxio/alluxio storage path

The file is now still visible to Alluxio, but it is not loaded into memory. The file now shows up

the same as before in the web UI except with 0% In-Memory. Note that we have to reference

the Alluxio folder (where Alluxio is mounted) when loading via hdfs commands.

3.2.2 Example: Removing a File from Alluxio Memory

When a file is loaded in Alluxio memory we can either remove the file completely from Alluxio

and the underFS, or just take the file out of memory and keep it in the underFS (assuming that

it is persisted).

$ alluxio fs free /path to free # Keep file within underFS.

$ alluxio fs rm −R /path to delete # Remove file from underFS.

3.3 Unified Namespace

When files are persisted through to the underFS from Alluxio memory (e.g.

“CACHE THROUGH” is set), all files and directories relative to the underFS mount point are

consistent. For example deleting or renaming a file via the Alluxio command line interface

will perform the same operations on the respective underFS directories. This keeps the Alluxio

filesystem and the underFS synchronised.

3.4 Web Interface

Alluxio has a nice web UI which can often be easier to use than the command line interface. To

access it go to “http://master hostname:1999” in a web browser. It should be noted that having

the correct version of Java running when starting Alluxio is important for the web UI to work.

I found that version 1.8.0 caused the web UI to crash but version 1.7.0 was fine.

3.4.1 Example: File Browsing

To view the files available to Alluxio on the web UI, click on the browse tab. Here you can see a

number of useful pieces of information about these files. We can see all the files and directories

visible to Alluxio, their sizes, percentage of them stored in memory etc. Clicking on a file shows

the locality of individual blocks.

6

3.4.2 Example: Alluxio Configuration

The Web UI also provides a very easy way to check that your changes to the Alluxio congifuration

have been made correctly, view this by clicking on the “configuration” tab:

3.5 Filesystem API

Instead of accessing Alluxio via the command line, we can also use the Filesystem API from

within Java and Scala. I have not found any use cases in the context of a Spark framework

where CLI or Spark commands have not served a better purpose than the filesystem API. A

possible use is for more general distributed programs to be written outside of a framework which

interface with Alluxio.

3.5.1 Java and Scala Compilation with Alluxio

To access the Alluxio classes from Java/Scala we need only add the Alluxio prebuilt jar to the

classpath:

$ export CLASSPATH=”$CLASSPATH:path\ to\ alluxio/core/client/target/alluxio−core−client−1.2.0−jar−with−dependencies.jar”

This jar was first generated when we built Alluxio with maven during the installation process.

Let us create a file called test stored in Alluxio memory with the filesystem API:

import alluxio . client . file .FileSystem;

import alluxio .AlluxioURI;

7

object test {def main(args: Array[String]) {

val fs = FileSystem.Factory.get()

val path = new AlluxioURI(”/test”)

val out = fs. createFile (path)

out.write(120)

out. close ()

}}

This creates a file called “test” in the root of the Alluxio storage directory. It writes the letter

“x” in this file and then closes it. To compile and execute the file we run the following:

$ scalac ”test . scala”

$ scala ”test”

An almost identical method is used for Java.

3.5.2 Scala Shell and Alluxio

The Scala shell can be useful to play around with Alluxio instead of having to compile a file

every time you want to do something. To access Alluxio with the scala shell simply have the

path to the prebuilt Alluxio jar in your classpath (see above section) and run the Scala shell.

You must then import all necessary Alluxio classes in the shell in order to perform your Alluxio

commands.

It is worth noting that the scala shell can be useful to view all available classes for Alluxio,

just press tab after writing “import alluxio.” in the scala shell:

3.5.3 Spark Shell and Alluxio

The Spark shell is also able to access the Alluxio filesystem. This is done in the same way as the

Scala shell. Ensure that the Spark classpath contains the Alluxio jar and start the Spark shell:

$ export SPARK CLASSPATH=”.:path to alluxio/core/client/target/alluxio−core−client−1.2.0−jar−with−dependencies.jar”

$ spark−shell −−master yarn

We can read/write files from/to Alluxio using Spark wrappers. Suppose we have a file called

8

“test” saved in our Alluxio storage directory (or the HDFS /alluxio directory). We can load the

file into Spark and count the number of lines in it:

val file = sc.textFile (”alluxio ://namenode adress:19998/test”)

file .count

Or we can take a file in the HDFS and save it into Alluxio. Suppose our file “test” is in the root

directory of our hdfs (where Alluxio is not mounted).

val file = sc.textFile (”hdfs://namenode adress:hdfs namenode port/test”)

file .saveAsTextFile(”alluxio://namenode adress:19998/test”)

3.6 Submit a Spark Job with Alluxio Using Maven

To submit an application to Spark which depends on Alluxio classes we must include the Alluxio

dependencies in our build tool (in this case Maven). Along with the standard Spark, Scala and

HDFS dependencies for a normal spark-maven build we add the following two dependencies to

the “pom.xml” file, which can be found on the maven repository:

<dependency>

<groupId>org.alluxio</groupId>

<artifactId>alluxio−core−common</artifactId>

<version>1.2.0</version>

</dependency>

<dependency>

<groupId>org.alluxio</groupId>

<artifactId>alluxio−core−client</artifactId>

<version>1.2.0</version>

</dependency>

3.6.1 Example: Accessing Alluxio from Spark Via Two Methods

We can use the following code to check our maven project build:

package package name

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext.

import org.apache.spark.SparkConf

import alluxio . client . file .FileSystem;

import alluxio .AlluxioURI;

9

object MainExample {def main(arg: Array[String]) {

// Method 1 − filesystem API.

val fs = FileSystem.Factory.get()

val path = new AlluxioURI(”/test”)

val out = fs. createFile (path)

out.write(120)

out. close ()

// Method 2 − Alluxio wrapper.

val conf = new SparkConf()

val sc = new SparkContext(conf)

val file = sc.textFile (”alluxio ://hostname.cern.ch:19998/test”)

file .saveAsTextFile(”alluxio://hostname.cern.ch:19998/test2”)

}}

Now we can package and submit the build:

$ mvn package

$ spark−submit −−class com.path to code −−master yarn

target/build name.jar

Assuming that we have the Spark classpath correctly set up (see above section) then we will

have access to the filesystem API and the Alluxio Spark wrapper. Note the wrapper is provided

by the prebuilt Alluxio Jar in the spark classpath and the Alluxio Filesystem API is provided

by the dependencies in the Maven build. I have not been able to find out how to include the

Alluxio-Spark wrapper dependencies into the “pom.xml” file although this should be possible.

3.7 Tiered Storage

Tiered storage [8] is a very powerful feature of Alluxio which allows the capacity of Alluxio to

overflow from memory into a set of other storage levels. This is very useful for many reasons.

Firstly, memory can be expensive and so designining a whole cluster with it’s entire primary data

store in memory is impractical. Also, memory I/O speeds are so high that often jobs become

throlled by processing times, therefore reducing these speeds by some margin will not see a huge

performance decrease.

This feature was tested by extending Alluxio to a second tier of SSD storage of 15GB per

worker. This is done by adding lines to the “alluxio.site.properties” file on each worker.


alluxio .worker. tieredstore . levels =2

alluxio .worker. tieredstore . level0 . alias =MEM

10

alluxio .worker. tieredstore . level0 . dirs .path=/mnt/ramdisk

alluxio .worker. tieredstore . level0 . dirs .quota=${alluxio.worker.memory.size}alluxio .worker. tieredstore . level0 .reserved. ratio=0.1

alluxio .worker. tieredstore . level1 . alias =SSD

alluxio .worker. tieredstore . level1 . dirs .path=/ssd

alluxio .worker. tieredstore . level1 . dirs .quota=30GB

alluxio .worker. tieredstore . level1 .reserved. ratio=0.1

$ cd alluxio home && alluxio copyDir conf # Sync changes made

Now in the Alluxio web UI we see that the capacity of Alluxio has been greatly increased.

The MEM/SSD separation is completely internal to Alluxio. Any program which interfaces

with Alluxio will still work exactly as it had before. Internally Alluxio has allocators and evictors

which organise data into their respective tiers of storage. Allocators choose which level of storage

and which directory data is written into for a worker. Evictors choose which blocks should change

their storage level when there is a shortage of space in a tier which the allocator has decided to

write to. The default allocators/evictors always prioritise maxmising the usage of the lowest tier

of storage (memory in this case).

I have found that these default allocators and evictors do not allow files to be optimally

loaded into Alluxio. By default new blocks are prioritised for memory over older blocks and so

there is a constant cycle of blocks being written to memory and then being pushed down to SSD

storage. Custom allocators/evictors can however be written which could improve this.

Files can also be pinned, to prevent them from being moved out of memory by evictors.

3.8 Metrics

Alluxio provides a wealth of metrics on jobs which have been performed in an instance of the

program running. A full list of available metrics along with supported sinks are available online

[9].

3.8.1 Example: Comparing Data Locality and Fault-Tolerence for Two

Methods

We can track how well Alluxio and the framework running on top are optimising data locality

in jobs. The two main methods which can be used to load files into Alluxio memory are via the

CLI and the Spark saveAsTextFile command.

Writing a file to Alluxio via the saveAsTextFile command in Spark by default produces a

non-replicated unpersisted file which is in memory. This is because when the Spark shell loads

11

and accesses Alluxio, it loads the Alluxio default configuration; the Spark shell has no knowledge

of the extra configuration files for Alluxio.

When running a count on this file no replication is enforced and so no data is written remotely

to other nodes. The “Blocks Read Remotely” value does not increase from its initial value (which

is non-zero due to the initial saveAsTextFile command to write into Alluxio), whereas the Blocks

Read Locally value does. I have run the count a number of times to show that the Blocks Read

Remotely value stays small but the Blocks Read Locally value continues to increase. This hidden

default behaviour from within Spark means that no blocks can be copied from the HDFS to new

nodes during processing.

This behaviour is consistent no matter how many executors or cores are running. However,

fault-tolerence is not enabled. If a node is killed during the count command then the job fails.

This is because there is no persistence to the underFS by default (MUST CACHE enabled).

A second method for loading a file into memory is via the Alluxio CLI which behaves ac-

cording to the Alluxio configuration files setup.

This loads a non-replicated file into Alluxio which is persisted (clearly, as as the file was

already on the underFS). When running a count on this file, replication increases in memory.

Because memory I/O is so fast, often it is faster to copy a file from the HDFS to a new node

than to wait for another node to finish its current task and start the new task. When running

the count with 5 executors and 3 cores per executors, the replication was kept down to a factor

of two.

However, when running with 5 executors and 2 cores per exectuors, then replication is

increased as more files are copied while waiting for executors to finish (they have less cores).

12

Fault torerance is also enabled now. Note, this is not due to the replication in memory, it

is due to the persistence in the underFS. Spark reports that an executor has failed but the job

finishes. A node in the cluster was killed and the blocks contained on this node were redistributed

to new nodes from the underFS.

It appears that persistence to the underFS dictates that fault-tolerence of Spark jobs and

that there is some other hidden behaviour which I cannot understand which dictates where blocks

are allowed to be replicated across workers in memory to increases job performance by reducing

executor wait times. The behaviour could also be one from Spark rather than Alluxio.

3.9 Controlling Alluxio Behaviour from within Spark

The previous section has shown that is is the persistence to the underFS from Alluxio which

enables fault-tolerance. It has also been noted that the default behaviour of Spark with Alluxio

is to run Alluxio with its out-of-the-box configuration, however this can be changed [10].

To change an Alluxio configuration parameter from its default value in Spark, one must pass

it as an additional Java option. This is done by adding a new line to the “spark-defaults.conf”

file:

$ vim sark home/conf/spark−defaults.conf

spark.executor.extraJavaOptions=−Dalluxio parameter=value

For example to enable fault tolerance on the Alluxio files created from the saveAsTextFile

command I must enforce persistence to the underFS. In cloudera manager I add the following

line to the “spark-conf/spark-defaults.conf client config safety valve” configuration option.

And we see this Cloudera propgates this change across all node configurations:

13

Now files are persisted when written using saveAsTextFile in Spark. We may use this method

for any of the Alluxio configuration options.

3.10 Lineage API

Alluxio has its own fault-tolerance solution which is independent of the framework used [11].

Since it is currently in alpha stages of devlopment and Spark already provides fault-tolerance

which is compatible with Alluxio, this has not been investigated further.

4 Testing

4.1 Tiered Storage vs. Standard CERN Cluster

Much distributed computing at CERN is done on HDD clusters which are inevitably slow and

throttled by the data storage I/O speeds. We therefore tested the viability of a new architecture

of cluster for processing. With a core memory speed data store and a secondary SSD data store.

Each node had 3000MB of memory storage and 30GB of SSD storage. 100.69GB of pagecount

data from Wikipedia was loaded into Alluxio.

A standard count was performed on these files, this was done in Spark with 5 executors,

1GB per executor, 1GB for ApplicationMaster:

14

Clearly the MEM + SSD cluster is faster, however in this case it was only twice as fast.

We would expect a greater improvement than the one above. It is suspected this is due to only

having 4 cores per node and so the jobs were process bound. Alluxio should utilise full SSD (and

memory) throughput given more cores to test with.

The implementation of the tiered storage archictecture was incredibly easy. Configuration

takes no more than a few command lines (assuming you have the hardware ready to go, see 3.7).

Changing the code from running the test on Alluxio vs. running it on the HDFS was just a matter

of referencing “hdfs://hostname:hdfs port/path” vs. “alluxio://hostname:alluxio port/path”.

Tiered storage is an incredibly powerful feature of Alluxio which can make Alluxio a very

viable solution for high-capacity distributed storage. It can allow a mixture of high throughput

and cheaper hardware to store data in a distributive manner, allowing control over hardware

speed vs. practicality.

For more information see [12]

4.2 Comparison to HDFS Caching

HDFS caching is an extension to HDFS which allows files to be cached in a memory pool which

is reserved for the storage system. Alluxio and HDFS caching are both very similar solutions to

in-memory distributed storage. However Alluxio improves upon HDFS caching in a number of

ways.

Both filesystems are referenced from within Spark similarly:

val file = sc.textFile (”hdfs://hostname:8020/fileName”)

val file2 = sc.textFile (”alluxio ://hostname:19998/fileName”)

Alluxio files are loaded into memory usually via the command line:

$ alluxio fs copyFromLocal /fileName

15

Or by the Spark saveAsTextFile command. HDFS caching requires a cache pool to be created,

then a directive is added to the pool:

$ hadoop cacheadmin −addPool <name>

$ hadoop cacheadmin −addDirective −path <path> −pool <pool−name>

Both tools require the user to preallocate a volume of memory to be used for caching. Also

they both allow full control over what files are cached into memory. Finally they also both allow

the user to see exactly which files are cached and how much of them are cached.

To compare the speeds of the two solutions, a 13GB textfile was loaded into memory and

filter was performed. The file was small to ensure that the whole thing could be stored in memory

on the testing cluster. The same resources were given to Spark as in the tiered storage example.

For completeness, the task was also performed on a standard HDD HDFS filesysten.

The key results to take from these tests are that:

• HDFS caching and Alluxio caching are both much faster than a standard HDFS HDD

cluster.

• On run 1, Alluxio caching is much faster than HDFS caching. Frequent one time access

from multiple jobs to a file could be a common use case, so this is potentially significant.

• Times agree upon a second run and show memory-speed throughput.

Alluxio improves on HDFS caching in two main ways. It has a fully controllable tiered

storage feature, which has been desribed previously; and it also allows the writing of files directly

into memory. Thus in general, Alluxio is a better solution than HDFS caching for in-memory

distributed data storage.

16

4.3 Comparison to Spark Caching

There are not many similarities between Spark caching and Alluxio caching. Spark caching (and

persistence in general) in Spark is intended to be used to checkpoint data which will be frequently

accessed in a job lifecycle. The RDD cannot persist across Spark jobs and is obviously only

accessible in Spark. Furthermore an action has to be performed on the RDD to cache initially,

so there is no ability to write directly to memory.

I found it very difficult to control how much memory was preallocated to caching for Spark,

however the documentation seems to suggest that this is possible. When a file is only partially

cached in Spark, performance is actually significantly reduced and can be inconsistent. The 13GB

textfile test from above was also tested with Spark caching where the file is not fully cached:

To test Spark caching on a file which could be fully cahced, I used a 3300MB text file stored

in the HDFS. I tested Spark caching in a Spark shell with 21GB of resources; 1GB for the

application manager and around 20GB for the executors:

I ran a simple filter job 40 times and recorded the time taken to complete it. This was

compared with the same files cached with Alluxio with the Spark saveAsTextFile command and

uncached in the HDFS. The same resources were given to the Spark shell each time.

17

The difference in speeds between Alluxio and Spark caching lies in the overhead required

for Spark to connect to the Alluxio filesystem. This effect is very potent in this test as we have

run a filter on a small file many times; Spark is connecting the the filesystem many times. If we

could compare much larger files in memory then this overhead should not be anywhere near as

obvious. Unfortunately the test environment available does not allow this.

In the previous test I could have loaded the file into Alluxio outside of Spark.

alluxio fs load /bigFile

The behaviour of this loading method is different in a couple of ways. Firstly given enough space

in memory, there will be replication of blocks in Alluxio memory which I believe is to minimise

time wasted wating for executors to finish certain processes. I allocated Alluxio more memory

for this option so that there was enough memory for this replication to occur.

Secondly this file is automatically persisted to the underFS, whereas the previous method

produced a file only present in memory. This behaviour is due to Spark loading Alluxio default

values unless specified (see 3.9). Persisted files in Alluxio allow Spark to operate in a fault-

tolerant maner. The job times were essentially the same as the previous method using Alluxio:

18

One of the major advantages of using Alluxio is that cached data can be accessed by multiple

jobs. In the first example, where a file was stored in Alluxio using Spark, I can access this data

from another application. Access from a different framework or job should not add any overhead

to processing times.

5 Further Remarks

While Alluxio may provide performance increase and further features on top of alternative solu-

tions, there are some issues which should be mentioned.

The Alluxio documentation is currently poor and so the learning curve for the tool is steep.

The uptake also appears to be slow and so finding useful tutorials online is difficult.

While the github for the software is available online, understanding the inner workings of

Alluxio is still difficult. For example, I have not been able to understand the replication behaviour

of the saveAsTextFile command in Spark to Alluxio completely (however this may be Spark

behaviour rather than Alluxio behaviour).

That being said, the software is still new, and development is still active. Many of these

issues may disappear with time. Also, once some time is spent with the software and a certain

level of understanding has been achieved, the software becomes very easy to use.

Finally, it is worth noting that Alluxio supports being managed by Yarn and also has some

security options which are compatible with HDFS security. These features have not been inves-

tigated.

6 Conclusions

When looking at solutions to in-memory distributed data storage, Alluxio clealy has the most

potential and it’s features appear the most powerful. HDFS caching does not work nearly as well

and Spark caching is not a cross-platform solution at all. So for implementing a cluster utilising

in-memory storage, Alluxio should be the clear choice.

However, Alluxio is still in its very early stages and as described above, it does not appear

ready to be implemented as a large scale solution. There are many issues still to be fixed and

many features are still in alpha.

If Alluxio continues to be developed, and its short-comings improved, then it has the potential

to be very beneficial for a number of CERN related use cases.

References

[1] Alluxio Homepage, http://http://www.alluxio.org/, 04/08/2016.

[2] Alluxio Installation with HDFS UnderFS, http://www.alluxio.org/docs/master/en/

Configuring-Alluxio-with-HDFS.html, 04/08/2016.

19

[3] Alluxio Installation with a Cluster, http://www.alluxio.org/docs/master/en/

Running-Alluxio-on-a-Cluster.html, 04/08/2016.

[4] Alluxio download directory, http://alluxio.org/downloads/files/, 08/08/2016.

[5] Alluxio-Spark Hostname Issue, https://issues.apache.org/jira/browse/SPARK-10149,

31/08/2016

[6] Alluxio configuration options, http://www.alluxio.org/docs/master/en/

Configuration-Settings.html, 09/08/2016.

[7] Alluxio command line interface, http://www.alluxio.org/docs/master/en/

Command-Line-Interface.html, 09/08/2016.

[8] Alluxio Tiered Storage, http://www.alluxio.org/docs/master/en/

Tiered-Storage-on-Alluxio.html, 31/08/2016

[9] Alluxio metrics, http://www.alluxio.org/docs/master/en/Metrics-System.html,

11/08/2016.

[10] Accelerating On-Demand Data Analytics with Alluxio, http://www.alluxio.com/

assets/uploads/2016/08/Accelerating_OnDemand_Data_Analytics_w_Alluxio.pdf,

15/09/2016

[11] Alluxio Lineage API, http://www.alluxio.org/docs/master/en/Lineage-API.html,

31/08/2016

[12] Tiered Storage in Alluxio, https://db-blog.web.cern.ch/blog/christopher-lawrie/

2016-09-using-tiered-storage-alluxio, 31/08/2016

[13] Experience of Using Alluxio, https://db-blog.web.cern.ch/blog/

christopher-lawrie/2016-08-experiences-using-alluxio-spark, 31/08/2016

20

evaluation of the suitability of alluxio for hadoop processing … · 2016-09-16 · software:...

Documents