data sci guide python-meetup - amazon web...
TRANSCRIPT
Spark Deployment
Configuring a cluster
170
Spark Deployment
Local Mode Cluster Mode
• Single threaded: SparkContext(‘local’)
• Multi-threaded: SparkContext(‘local[4]’)
• Pseudo-distributed cluster
• Standalone
• Mesos
• YARN
• Amazon EC2
171
Spark Deployment: Local
Mode Advantage
Single threaded sequential execution allows easier debugging of program logic
Multi-threaded concurrent execution leverages parallelism and allows debugging of coordination
Pseudo-distributed cluster distributed execution allows debugging of communication and I/O
172
Standalone
• Packaged with Spark core
• Great if all you need is a dedicated Spark cluster
• Doesn’t support integration with any other applications on a cluster.
The Standalone cluster manager also has a high-availability mode that can leverage Apache ZooKeeper to enable standby master nodes.
173
Mesos
• General purpose cluster and global resource manager (Spark, Hadoop, MPI, Cassandra, etc.)
• Two-level scheduler: enables pluggable scheduler algorithms
• Multiple applications can co-locate (like an operating system for a cluster)
174
YARN
• Created to scale Hadoop, optimized for Hadoop (stateless batch jobs with long runtimes)
• Monolithic scheduler: manages cluster resources as well as schedules jobs
• Not well suited for long-running, real-time, or stateful/interactive services (like database queries)
175
EC2
• Launch scripts bundled with Spark
• Elastic and ephemeral cluster
• Sets up:• Spark• HDFS• Hadoop MR
176
Spark Deployment: Cluster
Mode Advantage
Standalone Encapsulated cluster manager isolates complexity
Mesos global resource manager facilitates multi-tenant and heterogeneous workloads
YARN Integrates with existing Hadoop cluster and applications
EC2 elastic scalability and ease of setup
180
EC2: Setup
1. Create AWS account: https://aws.amazon.com/account/
2. Get Access keys:
181
1. Include the following lines in your
~/.bash_profile (or ~/.bashrc):
2. Download EC2 keypair:
EC2: Setup
export AWS_ACCESS_KEY_ID=xxxxxxxexport AWS_SECRET_ACCESS_KEY=xxxxxx
182
1. Launch EC2 cluster with script in $SPARK_HOME/ec2:
EC2: Launch
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem --copy-aws-credentials --instance-type=m1.large -m m1.large -s 19 launch spark
183
184
EC2: Scripts
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark
Login to the Master
./spark-ec2 stop spark
Stop cluster
./spark-ec2 destroy spark
Terminate Cluster
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem start spark
Restart cluster (after stopped)
185
Setup IPython/Jupyter
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark
Login to the Master
Installed needed packages (on master)
# Install all the necessary packages on Master
yum install -y tmux
yum install -y pssh
yum install -y python27 python27-devel
yum install -y freetype-devel libpng-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
easy_install py4j
pip2.7 install ipython==2.0.0
pip2.7 install pyzmq==14.6.0
pip2.7 install jinja2==2.7.3
pip2.7 install tornado==4.2
pip2.7 install numpy
pip2.7 install matplotlib
pip2.7 install nltk
186
Setup IPython/Jupyter
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark
Login to the Master
Installed needed packages (on workers)
# Install all the necessary packages on Workers
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pssh -t 10000 -h /root/spark-ec2/slaves pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 install nltk
187
Allow inbound requests to enable IPython/Jupyter notebook(WARNING: this will create a security risk however)
188
IPython/Jupyter Profile
./spark-ec2 -k keyname -i ~/.ssh/keyname.pem login spark
Login to the Master
Set notebook password
ipython profile create default
python -c "from IPython.lib import passwd; print passwd()" \ > /root/.ipython/profile_default/nbpasswd.txt
cat /root/.ipython/profile_default/nbpasswd.txt# sha1:128de302ca73:6b9a8bd5bhjde33d48cd65ad9cafb0770c13c9df
189
Configure IPython/Jupyter Settings
/root/.ipython/profile_default/ipython_notebook_config.py:
# Configuration file for ipython-notebook.
c = get_config()
# Notebook configc.NotebookApp.ip = '*'c.NotebookApp.open_browser = False# It is a good idea to put it on a known, fixed portc.NotebookApp.port = 8888
PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"c.NotebookApp.password = open(PWDFILE).read().strip()
190
Configure IPython/Jupyter Settings
/root/.ipython/profile_default/startup/pyspark.py:
# Configure the necessary Spark environmentimport osos.environ['SPARK_HOME'] = '/root/spark/'
# And Python pathimport syssys.path.insert(0, '/root/spark/python')
# Detect the PySpark URLCLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
191
Configure IPython/Jupyter Settings
Add the following to /root/spark/conf/spark-env.sh:
export PYSPARK_PYTHON=python2.7
Sync across workers:~/spark-ec2/copy-dir /root/spark/conf
Make sure master’s env is correct
source /root/spark/conf/spark-env.sh
192
IPython/Jupyter initialization (on master)
Start remote window manager (screen or tmux):
ipython notebook
screen
Start notebook server:
Ctrl-a d
Detach from session:
193
IPython/Jupyter login
On your laptop:http://[YOUR MASTER IP/DNS HERE]:8888
194
IPython/Jupyter login
Test that it all works:
195
EC2: Data
/root/ephemeral-hdfs
HDFS (ephemeral)
s3n://bucket_name
Amazon S3
/root/persistent-hdfs
HDFS (persistent)
196
Review• Local mode or cluster mode each have their benefits
• Spark can be run on a variety of cluster managers
• Amazon EC2 enables elastic scaling and ease of development
• By leveraging IPython/Jupyter you can get the performance of a cluster with the ease of interactive development