apache hive hook
Post on 11-May-2015
4.223 Views
Preview:
DESCRIPTION
TRANSCRIPT
Apache Hive Hook
2013. 8Minwoo Kim
michael.kim@nexr.com
Apache Hive Hook
• The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki.
• I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive.
• This document was written based on release-0.11 tag
• Source:
- https://github.com/apache/hive (mirror of apache hive)
What is a hook?• As you know, this is about computer programming technique,
but ..
• Hooking
- Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components.
• Hook
- Code that handles intercepted function calls, events or messages
Hive provides some hooking points
• pre-execution
• post-execution
• execution-failure
• pre- and post-driver-run
• pre- and post-semantic-analyze
• metastore-initialize
How to set up hooks in Hive
<property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description></property>
hive-site.xml
<property> <name>hive.aux.jars.path</name> <value></value></property>
Setting hook property
Setting path of jars contains implementations of hook interfaces or abstract class
You can use hive.added.jars.path instead of hive.aux.jars.path
Hive hook properties and interfaces
Property Interface or Abstract class
hive.exec.pre.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PreExecute is deprecated)
hive.exec.post.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PostExecute is deprecated)
hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener
hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook
hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
When those hooks fire?
• You can submit a query on Hive through the following entry points
- CLIDriver main method (called by shell script)
- HCatCli main method (called by shell script)
- HiveServer (called by thrift client)
- HiveServer2 (called by thrift client or beeline)
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?yes
no
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?yes
no
HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
HiveServer.execute() ➔ Driver.run() ➠
HiveServer
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?yes
no
HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
HiveServer2
ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()
CLIService.executeStatement()
↳ SessionManager.getSession()
↳ HiveSession.executeStatement()
↳ OperationManager.newExecuteStatementOperation()
↳ SQLOperation.run() ➔ Driver.run() ➠
⤶
HiveServer2
ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()
CLIService.executeStatement()
↳ SessionManager.getSession()
↳ HiveSession.executeStatement()
↳ OperationManager.newExecuteStatementOperation()
↳ SQLOperation.run() ➔ Driver.run() ➠
• OperationManager.newExecuteStatementOperation() is like a kind of factory
- AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation
⤶
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↝ HiveParser {• HiveParser.g
- SelectClauseParser.g- FromClauseParser.g- IdentifiersParser.g
• ParseDriver.parse()
- Command String ➡ root of AST tree
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• SemanticAnalyzerFactory.get(conf, ast)
- SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• FilterOperator
• SelectOperator
• ForwardOperator
• FileSinkOperator
• ScriptOperator
• PTFOperator
• ReduceSinkOperator
• ExtractOperator
• GroupByOperator
• JoinOperator
• MapJoinOperator
• SMBMapJoinOperator
• LimitOperator
• TableScanOperator
• UnionOperator
• UDTFOperator
• LateralViewJoinOperator
• LateralViewForwardOperator
• HashTableDummyOperator
• HashTableSinkOperator
• DummyStoreOperator
• DemuxOperator
• MuxOperator
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• PredicateTransitivePropagate
• PredicatePushDown
• PartitionPruner
• PartitionConditionRemover
• ListBucketingPruner
• ListBucketingPruner
• ColumnPruner
• SkewJoinOptimizer
• RewriteGBUsingIndex
• GroupByOptimizer
• SamplePruner
• MapJoinProcessor
• BucketMapJoinOptimizer
• BucketMapJoinOptimizer
• SortedMergeBucketMapJoinOptimizer
• BucketingSortingReduceSinkOptimizer
• UnionProcessor
• JoinReorder
• ReduceSinkDeDuplication
• NonBlockingOpDeDupProc
• GlobalLimitOptimizer
• CorrelationOptimizer
• SimpleFetchOptimizer
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• MapRedTask
• FetchTask
• ConditionalTask
• ExplainTask
• CopyTask
• DDLTask
• MoveTask
• FunctionTask
• StatsTask
• ColumnStatsTask
• DependencyCollectionTask
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()
{
• ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob()
ExecMapper, ExecReducer
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
PRE- and POST-DRIVER-RUN
PRE- and POST-SEMANTIC-ANALYZE
PRE-, POST-EXEC and ON-FAILURE
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
➠ new HiveMetaStoreClient()
➔ HiveMetaStore.newHMSHandler()
➔ RetryingHMSHandler.getProxy()
➔ new RetryingHMSHandler()
➔ new HMSHandler() ➔ HMSHandler.init()
➔ HiveMetaStore.init()
CLIService.executeStatement()
⇒
MATASTORE-INIT
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
How Hive executes hooks
List<HiveDriverRunHook> driverRunHooks;try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); }} catch (Exception e) {
• Hive executes multiple hooks on each hook points.
ex. Driver.runInternal()
1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){ this.conf = config; }
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override public Configuration getConf() { return this.conf; }
@Override public void setConf(Configuration config) { this.conf = config; }}
1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){ this.conf = config; }
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override public Configuration getConf() { return this.conf; }
@Override public void setConf(Configuration config) { this.conf = config; }}
What MetaStoreInitContext got
• has Nothing!
- This hook just alarms you when metastore initialize.(but you, of course, can get HiveConf by calling getConf())
public class MetaStoreInitContext { }
2. HiveDriverRunHook
• preDriverRun
- Invoked before Hive begins any processing of a command in the Driver, before compilation
• postDriverRun
- Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run()
public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception;}
What HiveDriverRunHookContext got
• You can get command string from this hook context.
- This is the only thing that HiveDriverRunHookContext has.
public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command);}
3. AbstractSemanticAnalyzerHook
• You can get
- HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze.
- HiveSemanticAnalyzerHookContext and List<Task> after analyze.
public abstract class AbstractSemanticAnalyzerHook implementsHiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { }}
What HiveSemanticAnalyzerHookContext got
• Hive Object
- contains information about a set of data in HDFS organized for query processing. (from comment)
• ReadEntity, WriteEntity
• update method will be invoked after the semantic analyzer completes.
public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs();}
How Hive executes analyzer hooks
List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}
How Hive executes analyzer hooks
List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}
How Hive executes analyzer hooks
List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}
How Hive executes analyzer hooks
List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}
How Hive executes analyzer hooks
List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}
4. ExecuteWithHookContext• Can be used in the followings
- hive.exec.pre.hooks
- hive.exec.post.hooks
- hive.exec.failure.hooks
public interface ExecuteWithHookContext extends Hook { /** * * @param hookContext * The hook context passed to each hooks. * @throws Exception */ void run(HookContext hookContext) throws Exception; }
What HookContext got• HookType
- PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK
• QueryPlan
• HiveConf
• LineageInfo
• UserGroupInformation
• OperationName
• List<TaskRunner> completeTaskList
• Set<ReadEntity> inputs
• Set<WriteEntity> outputs
• Map<String, ContentSummary> inputPathToContentSummary
How Hive fires hooks without executing query physically
• This has the effect of causing the pre/post execute hooks to fire.
ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
MetaStore Event Listeners
Property Abstract Class
hive.metastore.pre.event.listeners MetaStorePreEventListener
hive.metastore.end.function.listeners MetaStoreEndFunctionListener
hive.metastore.event.listeners MetaStoreEventListener
package : org.apache.hadoop.hive.metastore
• I think those listeners look like hooks.
• I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read.
• Anyway, it looks useful to let you know when a metastore do something.
MetaStoreEventListener• The followings will be performed when a particular event occurs on a
metastore.
- onCreateTable
- onDropTable
- onAlterTable
- onDropPartition
- onAlterPartition
- onCreateDatabase
- onDropDatabase
- onLoadPartitionDone
If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
Be careful!
• Hooks
- can be a critical failure point!(you should better catch runtime exceptions)
- are preformed synchronously.
- can affect query processing time.
Let's try it out
• Demo
- Don’t be surprised if it doesn’t work.
- That’s the way the demo is...
Thanks!
• Questions?
• Resources
- https://cwiki.apache.org/confluence/display/Hive/
- https://github.com/apache/hive
top related