spark meet up august 2014 public
TRANSCRIPT
-
5/20/2018 Spark Meet Up August 2014 Public
1/27
Spark at eBay -Troubleshooting theeveryday issues
Aug. 6, 2014
Seattle Spark Meetup
Don Watters - Sr. Manager of Architecture, eBay Inc.
Suzanne Monthofer - Solutions Architect, eBay Inc.
-
5/20/2018 Spark Meet Up August 2014 Public
2/27
Agenda
eBay Overview
Spark Motivation
Use Cases At eBay
Troubleshooting the everyday issues
2
-
5/20/2018 Spark Meet Up August 2014 Public
3/27
eBay Overview
3
> 50 thousand categories of products
> 200 million items listed for sale on the siteAverage retailer has thousands of products
-
5/20/2018 Spark Meet Up August 2014 Public
4/27
4
PLATFORM
-
5/20/2018 Spark Meet Up August 2014 Public
5/27
55
Data @ eBay
5
>50 TB/day new data
>100 PB/day
>100 Trillion pairs of information
Millions of queries/day
>6000business users & analysts
>50k chains of logic
24x7x365
99.98+%Availability
turning over a TBevery secondActive/Active
Near-Real-time
>100k data elements
Always online
Processed
-
5/20/2018 Spark Meet Up August 2014 Public
6/27
Spark Motivation
Great Promise!
Fits our pattern well
Iterative approach possible, like SQL
6
-
5/20/2018 Spark Meet Up August 2014 Public
7/27
7
-
5/20/2018 Spark Meet Up August 2014 Public
8/27
Agenda
Use Cases At eBay
8
-
5/20/2018 Spark Meet Up August 2014 Public
9/27
9
eBay Transformer = More Data
-
5/20/2018 Spark Meet Up August 2014 Public
10/27
Agenda
Troubleshooting the everyday issues
10
-
5/20/2018 Spark Meet Up August 2014 Public
11/27
Tools and Skill sets
JIRA issue trackinginternal and apache
Github repositorysource version control, documentation (.md) Compilation/dependencies - Mavenjar dependencies
Javaversioning, debugging stack traces, environments, multiple JDK/JREs,compatibility errors
POSIX OSenvironment variables, directory structures, permissions, Shellscripting
HDFS, hadoop queues, formats, compression
Yarn/Mesosenvironments, debugging, logs, killing
JIRA internal wikisglobal internal collaboration
User groups, internal DLs, platform support teams, informal emails
Ability to decipher Java Stack traces
Stack Overflow, Googling, indirect clues Scrappiness:
when dwarfed by a challenge, compensating for seeming inadequaciesthrough will, persistence and heart
11
-
5/20/2018 Spark Meet Up August 2014 Public
12/27
Most Common Question: Yarn ShellException
(GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container status forcontainerID=container_1392317581183_0245_01_000003, state=COMPLETE, exitStatus=1, diagnostics=Exception from container
org.apache.hadoop.util.Shell$ExitCodeException:at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:252
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:28
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
12
Means an error occurred in the Yarn containerneed to search for Javastack trace deeper in the Yarn logs
-
5/20/2018 Spark Meet Up August 2014 Public
13/27
Killing Yarn Jobs and Viewing Yarn
Logs and status in many places:
Hadoop console (transientdisappear after job done) Aggregated Yarn logsnot available until job finishes or is killed
Execution shellonly very high-level status
Killing: Ctrl-C, then/apache/hadoop/bin/yarn application -kill application_1392973982912_7321
Viewing Logs:/apache/hadoop/bin/yarn logs -applicationIdapplication_1392973982912_7321
Sifting to find text Exception, Memory, etc.| grep Exception -5
| grep Memory -5
Would like easier debugging and exiting on errors
May look at a log4j appenders
13
-
5/20/2018 Spark Meet Up August 2014 Public
14/27
Biggest Challenge: Resource Allocation/CapacityScheduling
Users must request needed resources
Long-running jobs hang without releasing resources and must be killedmanually
Created a dedicated Spark queuestill not equitable
Capacity allocation prioritization is complex
Spark shell hangs on to memory
Many users deciding to wait for better stability and better guarantee ofresource availability and job completion
Yarn vs. Mesos debate?
14
-
5/20/2018 Spark Meet Up August 2014 Public
15/27
Tuning SparkHanging Jobs and Out-of-MemoryErrors
spark.default.parallelism- # requested Yarn containers
spark.executor.memory- ~75-90% requested Yarn container memory size
spark.storage.memoryFraction- lower from default 0.6 to ~0.2 (if you are not pinningsignificant amount of data)
Remove outliers from dataset (dual-pass with larger entities)
Use primitive data typesavoid Strings
Use Kryo serialization
app UI at localhost:4040 (disabled on our cluster)
Need to understand inner workings of Spark
Community working to reduce the amount of configuration needed
Alex Rubensteyn blog post: Spark should be better than MapReduce (if only it worked)
http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1
Patrick Wendells talk on performance at Spark Summit 2013:
https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/
Tuning Guide: https://spark.apache.org/docs/latest/tuning.html
15
http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark.apache.org/docs/latest/tuning.htmlhttps://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1 -
5/20/2018 Spark Meet Up August 2014 Public
16/27
Yarn Improvements Needed for Spark
16
Great talk by Sandy Ryza from Cloudera at Spark Summit 2014
https://www.youtube.com/watch?v=N6pJhxCPe-Y
-
5/20/2018 Spark Meet Up August 2014 Public
17/27
Rapid Pace of Change
17
-
5/20/2018 Spark Meet Up August 2014 Public
18/27
18
SPARK-1203
spark-shell on yarn-client race in properly getting hdfs delegation
tokens -error on saveAsTextFileException in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException):
Delegation Token can be issued only with kerberos or web authenticatio at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:62
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcSe
...
at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:920)
at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1336)
at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:527)
at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:505)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.j at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.j
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)
Burnt by bugs in snapshots during incubating phase
Check Spark JIRA issues https://issues.apache.org/jira/browse/SPARK/
https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203https://spark-project.atlassian.net/browse/SPARK-1203 -
5/20/2018 Spark Meet Up August 2014 Public
19/27
Apache SharkHive on Spark
NOW OBSOLETE
Google protobuf error (notorious)had to replace bundled jarCaused by: java.lang.VerifyError: classorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
Had to replace hadoop core/security jars with eBay jars
JDBC driver: mysql-connector-java-5.0.8-bin.jar
Got it working on single nodeable to access/query existing hive tables
Couldnt use for extremely large tables/joins yet (need multi-node)
Requires JDK 1.7couldnt run on multiple nodes in cluster (still 1.6)
./bin/shark-withinfoskipRddReloadto avoid a bad table error
Performance 2-5xs better than Hive for 8M row table count query
Start Looking at Spark SQL!
19
-
5/20/2018 Spark Meet Up August 2014 Public
20/27
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:java.lang.RuntimeException: Unable to instantiateorg.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
at shark.SharkCliDriver.(SharkCliDriver.scala:283)at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
at shark.SharkCliDriver.main(SharkCliDriver.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:51)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
... 4 more
Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1137)
... 9 more
Caused by: java.lang.VerifyError: classorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$wnerRequestProto overrides final methodgetUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)at java.security.SecureClassLoader.defineClass(Unknown Source)) 20
-
5/20/2018 Spark Meet Up August 2014 Public
21/27
Shark Jar Incompatibilities
21
Caused by: KrbException: Server not found in Kerberos database (7)at sun.security.krb5.KrbTgsRep.(Unknown Source)
at sun.security.krb5.KrbTgsReq.getReply(Unknown Source)at sun.security.krb5.KrbTgsReq.sendAndGetCreds(Unknown Source)at sun.security.krb5.internal.CredentialsUtil.serviceCreds(Unknown Source)
14/05/07 17:49:58 ERROR security.UserGroupInformation: PriviledgedActionExcepas:[email protected] cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Server not found in Ker
database (7))]14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating logout for [email protected]/05/07 17:49:58 INFO security.UserGroupInformation: Initiating re-login [email protected]/05/07 17:50:02 ERROR security.UserGroupInformation: PriviledgedActionException as:[email protected]
ause:javax.security.sasl.SaslException: GSS initiate failed [Caused byGSSException: No valid credentials provided (Mechanism level: Server not f
in Kerberos database (7))]14/05/07 17:50:02 WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was
attempted less than 600 seconds before.
http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/http://corp.ebay.com/ -
5/20/2018 Spark Meet Up August 2014 Public
22/27
Shark vs. Hive, Spark SQL vs SharkBig Data Benchmarks
22
https://amplab.cs.berkeley.edu/benchmark/
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
-
5/20/2018 Spark Meet Up August 2014 Public
23/27
Compilation: Maven, sbt, ivy, ant
Maven/sbt/ivy/munge can be complex, finicky
[info] Resolving com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT ...[warn] module not found: com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT
[warn] ==== local: tried
[warn] /Users/smonthofer/.ivy2/local/com.ebay.incdata.metis/metis-matching-engine/1.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pom
[warn] ==== Local Maven Repository: tried
[warn] file:///var/root/.m2/repository/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pomURIhas an authority component
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
java.net.MalformedURLException: no protocol: /Users/smonthofer/.m2/repository
build.sbt resolvers +="Local Maven Repository" at file:///Users/smonthofer/.m2/repository
Needed 3 slashes (platform independence feature)!!! Grrrr
23
http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/http://localhost/var/www/apps/conversion/tmp/scratch_1//localhost/Users/smonthofer/.m2/repository/ -
5/20/2018 Spark Meet Up August 2014 Public
24/27
Learned New Term:Yak Shaving
24
From Urban Dictionary:
Any seemingly pointless activity which is actuallynecessary to solve a problem which solves a
problem which, several levels of recursion later,solves the real problem you're working on.
origin: MIT AI Lab, after 2000: orig. probably from a
Ren & Stimpy episode.
Building scalable systems is not all sexy roflscale fun.Its a lot of plumbing and yak shaving. A lot of
hacking together tools that really ought to existalready, but all the open source solutions out thereare too bad (and yours ends up bad too, but at leastit solves your particular problem).
- Martin Kleppmann, LinkedIn, Founder of Rapportive
http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode.http://www.urbandictionary.com/define.php?term=MIT%20AI%20Lab,%20after%202000:%20orig.%20probably%20from%20a%20Ren%20&%20Stimpy%20episode. -
5/20/2018 Spark Meet Up August 2014 Public
25/27
25
Simple documentation savestime later for yourself and forothers
Cut/paste/collect things that work, errors,common commands and put on a wiki page(even email drafts are a fast holding place).
Source control/backups for working versionsbe able to start from scratch
Maven, sbt, dependenciescomplex,corruptible, bizarre tricks, multiple opensource projectsmagic (also scary)
Get ahead of the curve on new technologycause new challenges will always come up
From xkcd
-
5/20/2018 Spark Meet Up August 2014 Public
26/27
If you want to succeed as badly as you want the air,then you will get it there is no other secret to success
- Socrates (lesson to his students)
Privileged and Confidential 26
Quoted by Spark User group user:
-
5/20/2018 Spark Meet Up August 2014 Public
27/27
Spark at eBay -Troubleshooting theeveryday issues
Aug. 6, 2014
Seattle Spark Meetup
Suzanne [email protected]