spark meet up august 2014 public

5/20/2018 Spark Meet Up August 2014 Public

1/27

Spark at eBay -Troubleshooting theeveryday issues

Aug. 6, 2014

Seattle Spark Meetup

Don Watters - Sr. Manager of Architecture, eBay Inc.

Suzanne Monthofer - Solutions Architect, eBay Inc.


2/27

Agenda

eBay Overview

Spark Motivation

Use Cases At eBay

Troubleshooting the everyday issues

2


3/27

eBay Overview

3

> 50 thousand categories of products

> 200 million items listed for sale on the siteAverage retailer has thousands of products


4/27

4

PLATFORM


5/27

55

Data @ eBay

5

>50 TB/day new data

>100 PB/day

>100 Trillion pairs of information

Millions of queries/day

>6000business users & analysts

>50k chains of logic

24x7x365

99.98+%Availability

turning over a TBevery secondActive/Active

Near-Real-time

>100k data elements

Always online

Processed


6/27

Spark Motivation

Great Promise!

Fits our pattern well

Iterative approach possible, like SQL

6


7/27

7


8/27

Agenda

Use Cases At eBay

8


9/27

9

eBay Transformer = More Data


10/27

Agenda

Troubleshooting the everyday issues

10


11/27

Tools and Skill sets

JIRA issue trackinginternal and apache

Github repositorysource version control, documentation (.md) Compilation/dependencies - Mavenjar dependencies

Javaversioning, debugging stack traces, environments, multiple JDK/JREs,compatibility errors

POSIX OSenvironment variables, directory structures, permissions, Shellscripting

HDFS, hadoop queues, formats, compression

Yarn/Mesosenvironments, debugging, logs, killing

JIRA internal wikisglobal internal collaboration

User groups, internal DLs, platform support teams, informal emails

Ability to decipher Java Stack traces

Stack Overflow, Googling, indirect clues Scrappiness:

when dwarfed by a challenge, compensating for seeming inadequaciesthrough will, persistence and heart

11


12/27

Most Common Question: Yarn ShellException

(GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container status forcontainerID=container_1392317581183_0245_01_000003, state=COMPLETE, exitStatus=1, diagnostics=Exception from container

org.apache.hadoop.util.Shell$ExitCodeException:at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)

at org.apache.hadoop.util.Shell.run(Shell.java:379)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)

at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:252

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:28

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

12

Means an error occurred in the Yarn containerneed to search for Javastack trace deeper in the Yarn logs


13/27

Killing Yarn Jobs and Viewing Yarn

Logs and status in many places:

Hadoop console (transientdisappear after job done) Aggregated Yarn logsnot available until job finishes or is killed

Execution shellonly very high-level status

Killing: Ctrl-C, then/apache/hadoop/bin/yarn application -kill application_1392973982912_7321

Viewing Logs:/apache/hadoop/bin/yarn logs -applicationIdapplication_1392973982912_7321

Sifting to find text Exception, Memory, etc.| grep Exception -5

| grep Memory -5

Would like easier debugging and exiting on errors

May look at a log4j appenders

13


14/27

Biggest Challenge: Resource Allocation/CapacityScheduling

Users must request needed resources

Long-running jobs hang without releasing resources and must be killedmanually

Created a dedicated Spark queuestill not equitable

Capacity allocation prioritization is complex

Spark shell hangs on to memory

Many users deciding to wait for better stability and better guarantee ofresource availability and job completion

Yarn vs. Mesos debate?

14


15/27

Tuning SparkHanging Jobs and Out-of-MemoryErrors

spark.default.parallelism- # requested Yarn containers

spark.executor.memory- ~75-90% requested Yarn container memory size

spark.storage.memoryFraction- lower from default 0.6 to ~0.2 (if you are not pinningsignificant amount of data)

Remove outliers from dataset (dual-pass with larger entities)

Use primitive data typesavoid Strings

Use Kryo serialization

app UI at localhost:4040 (disabled on our cluster)

Need to understand inner workings of Spark

Community working to reduce the amount of configuration needed

Alex Rubensteyn blog post: Spark should be better than MapReduce (if only it worked)

http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1

Patrick Wendells talk on performance at Spark Summit 2013:

https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/

Tuning Guide: https://spark.apache.org/docs/latest/tuning.html

15
http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1


16/27

Yarn Improvements Needed for Spark

16

Great talk by Sandy Ryza from Cloudera at Spark Summit 2014

https://www.youtube.com/watch?v=N6pJhxCPe-Y


17/27

Rapid Pace of Change

17


18/27

18

SPARK-1203

spark-shell on yarn-client race in properly getting hdfs delegation

tokens -error on saveAsTextFileException in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException):

Delegation Token can be issued only with kerberos or web authenticatio at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:62

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcSe

...

at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:920)

at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1336)

at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:527)

at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:505)

at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.j at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.j

at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)

Burnt by bugs in snapshots during incubating phase

Check Spark JIRA issues https://issues.apache.org/jira/browse/SPARK/
https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203


19/27

Apache SharkHive on Spark

NOW OBSOLETE

Google protobuf error (notorious)had to replace bundled jarCaused by: java.lang.VerifyError: classorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method

getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(Unknown Source)

Had to replace hadoop core/security jars with eBay jars

JDBC driver: mysql-connector-java-5.0.8-bin.jar

Got it working on single nodeable to access/query existing hive tables

Couldnt use for extremely large tables/joins yet (need multi-node)

Requires JDK 1.7couldnt run on multiple nodes in cluster (still 1.6)

./bin/shark-withinfoskipRddReloadto avoid a bad table error

Performance 2-5xs better than Hive for 8M row table count query

Start Looking at Spark SQL!

19


20/27

Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:java.lang.RuntimeException: Unable to instantiateorg.apache.hadoop.hive.metastore.HiveMetaStoreClient

at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)

at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)

at shark.SharkCliDriver.(SharkCliDriver.scala:283)at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)

at shark.SharkCliDriver.main(SharkCliDriver.scala)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:51)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)

at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)

... 4 more

Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)

at java.lang.reflect.Constructor.newInstance(Unknown Source)

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1137)

... 9 more

Caused by: java.lang.VerifyError: classorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$wnerRequestProto overrides final methodgetUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(Unknown Source)at java.security.SecureClassLoader.defineClass(Unknown Source)) 20


21/27

Shark Jar Incompatibilities

21

Caused by: KrbException: Server not found in Kerberos database (7)at sun.security.krb5.KrbTgsRep.(Unknown Source)

at sun.security.krb5.KrbTgsReq.getReply(Unknown Source)at sun.security.krb5.KrbTgsReq.sendAndGetCreds(Unknown Source)at sun.security.krb5.internal.CredentialsUtil.serviceCreds(Unknown Source)

14/05/07 17:49:58 ERROR security.UserGroupInformation: PriviledgedActionExcepas:[email protected] cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by

GSSException: No valid credentials provided (Mechanism level: Server not found in Ker

database (7))]14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating logout for [email protected]/05/07 17:49:58 INFO security.UserGroupInformation: Initiating re-login [email protected]/05/07 17:50:02 ERROR security.UserGroupInformation: PriviledgedActionException as:[email protected]

ause:javax.security.sasl.SaslException: GSS initiate failed [Caused byGSSException: No valid credentials provided (Mechanism level: Server not f

in Kerberos database (7))]14/05/07 17:50:02 WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was

attempted less than 600 seconds before.
http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/


22/27

Shark vs. Hive, Spark SQL vs SharkBig Data Benchmarks

22

https://amplab.cs.berkeley.edu/benchmark/

http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html


23/27

Compilation: Maven, sbt, ivy, ant

Maven/sbt/ivy/munge can be complex, finicky

[info] Resolving com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT ...[warn] module not found: com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT

[warn] ==== local: tried

[warn] /Users/smonthofer/.ivy2/local/com.ebay.incdata.metis/metis-matching-engine/1.0-SNAPSHOT/ivys/ivy.xml

[warn] ==== public: tried

[warn] http://repo1.maven.org/maven2/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pom

[warn] ==== Local Maven Repository: tried

[warn] file:///var/root/.m2/repository/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pomURIhas an authority component

at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)

at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

[warn] :: UNRESOLVED DEPENDENCIES ::

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

java.net.MalformedURLException: no protocol: /Users/smonthofer/.m2/repository

build.sbt resolvers +="Local Maven Repository" at file:///Users/smonthofer/.m2/repository

Needed 3 slashes (platform independence feature)!!! Grrrr

23
http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/


24/27

Learned New Term:Yak Shaving

24

From Urban Dictionary:

Any seemingly pointless activity which is actuallynecessary to solve a problem which solves a

problem which, several levels of recursion later,solves the real problem you're working on.

origin: MIT AI Lab, after 2000: orig. probably from a

Ren & Stimpy episode.

Building scalable systems is not all sexy roflscale fun.Its a lot of plumbing and yak shaving. A lot of

hacking together tools that really ought to existalready, but all the open source solutions out thereare too bad (and yours ends up bad too, but at leastit solves your particular problem).

- Martin Kleppmann, LinkedIn, Founder of Rapportive
http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.


25/27

25

Simple documentation savestime later for yourself and forothers

Cut/paste/collect things that work, errors,common commands and put on a wiki page(even email drafts are a fast holding place).

Source control/backups for working versionsbe able to start from scratch

Maven, sbt, dependenciescomplex,corruptible, bizarre tricks, multiple opensource projectsmagic (also scary)

Get ahead of the curve on new technologycause new challenges will always come up

From xkcd


26/27

If you want to succeed as badly as you want the air,then you will get it there is no other secret to success

- Socrates (lesson to his students)

Privileged and Confidential 26

Quoted by Spark User group user:


27/27

Spark at eBay -Troubleshooting theeveryday issues

Aug. 6, 2014

Seattle Spark Meetup

Don [email protected]

Suzanne [email protected]

spark meet up august 2014 public

Documents