db12c: all you need to know about the resource manager

DB12C: ALL YOU NEED TO KNOW

ABOUT THE RESOURCE MANAGER

Māris ElsiņšLead Database ConsultantPythian

@MarisElsins

MARIS ELSINS

Lead Database Consultant at PythianOracle [Apps] DBA since 2005

Speaker at conferences since 2007

@MarisElsins [email protected]

http://bit.ly/getMOSPatch

ABOUT PYTHIAN

3

Founded in 1997, Pythian is a global leader in data consulting and managed services specializing in planning, optimizing, and managing mission-critical data systems

Top 5% talent worldwide

10 Oracle ACEs

3 Oracle ACE Directors

18 years in business

450+ employees

250+ customers worldwide

AGENDA

• Features of the Resource Manager

• The new 12c-stuff

• Consolidations using Oracle Multitenant

• Overhead of the RM

4

FEATURES OF THE RESOURCE MANAGER

5

THE PROBLEM

• OS doesn’t care enough about DB sessions/processes according to what business requires– Assigns the same priority to all processes– CPU resources are equally distributed among all processes– Inability to manage DB-specific resources/situations

• CPU distribution among sessions, Parallel Execution Servers, Active session Pool and Queuing, Undo usage, Runaway Queries, Blocking sessions

– Context switching overhead when many processes running

• Problems start when there’s not enough CPU for everyone

• CPU starvation can be hard to recover from(the snowball effect)

• CPU starvation makes online troubleshooting hard to do

6

PROBLEM SCENARIOS - QUIZ TIME!

• Running reports causes too much load on the OLTP system.• One of the sessions allocate all parallel query slaves therefore other sessions don’t get any

• Application support team runs heavy queries to analyze the data leaving less resources for online transactions

• Wide search criteria cause “hangs” in the search form• 3 of 8 CPU cores are idle, my query runs without parallel execution, I could use the idle CPUs to provide results faster

• Users don’t log out and leave idle sessions• My batch process requires DOP=8 to complete in time, but it’s downgraded to smaller DOP if enough parallel slaves are not available

• My query is very important. It’s IO requests have to be prioritized!• Sessions with incomplete transactions have locked some rows and other sessions have stuck.

7

THE BASIC CONCEPTS

• Resource Manager– Included in Oracle EE license– Allows prioritization of sessions according to the defined business

requirements– Allows defining the guaranteed amount of allocated resources for each type

of sessions (consumer group)– Resources not used by higher priority sessions, can be used by lower priority

sessions

• Prioritization is achieved by changing the process states to running/sleeping– DBRM / VKRM (CPU scheduling)– Semaphores (wake up sleeping processes)– CPU quantum (_dbrm_quantum)

• Resource manager does not solve the «lack of CPU resources» problem, it just controls the execution queue

• Resource manager uses some resources too, the last part of the presentation will estimate the overhead

8

THE BASIC CONCEPTS

9

• Consumer group – Set of sessions having similar

requirements for server resources– Resources are allocated to the

consumer group, not individual sessions

– DBA_RSRC_CONSUME_GROUPS

• Directives– Rules that define resource allocation

to the consumer group– DBA_RSRC_PLAN_DIRECTIVES

• Resource plan– Set of directives defining the

distribution of resources among consumer groups

– DBA_RSRC_PLANS

SQL> select event, count(*) from v$session group by event order by 2 desc;

EVENT COUNT(*)---------------------------------------------------------------- ----------resmgr:cpu quantum 25rdbms ipc message 23Space Manager: slave idle wait 16SQL*Net message from client 9EMON slave idle wait 5DIAG idle wait 2LGWR worker group idle 2GCR sleep 2Streams AQ: waiting for time management or cleanup tasks 1VKTM Logical Idle Wait 1AQPC idle 1Streams AQ: qmn coordinator idle wait 1VKRM Idle 1PING 1...

23 rows selected.

RESMGR:CPU QUANTUM

WHY IS MY SESSION NOT RUNNING?

10

RESMGR:CPU QUANTUM

WHY IS MY SESSION NOT RUNNING?SQL> select event, status, count(*) from v$session

where event='resmgr:cpu quantum' group by event, status order by 1,2;

EVENT STATUS COUNT(*)------------------ -------- ----------resmgr:cpu quantum ACTIVE 25

11

RESMGR:CPU QUANTUM

WHY IS MY SESSION NOT RUNNING?

12

SQL> select event, status, state, count(*)from v$session where event='resmgr:cpu quantum' group by event, status, state order by 1,2,3;

EVENT STATUS STATE COUNT(*) ------------------ -------- ------------------- ----------resmgr:cpu quantum ACTIVE WAITED KNOWN TIME 7 resmgr:cpu quantum ACTIVE WAITED SHORT TIME 16 resmgr:cpu quantum ACTIVE WAITING 2

RESMGR:CPU QUANTUM

WHY IS MY SESSION NOT RUNNING?• EVENT values are often misinterpreted in:

– V$SESSION– V$SESSION_WAIT

• Common mistake is to forget about v$session.STATE!

• If STATE = 'WAITING’, only then the session is waiting– EVENT shows what the session is waiting for– STATUS can be ACTIVE or INACTIVE

• If STATE = 'WAITED % TIME’ ..– and STATUS = 'ACTIVE', the session is ON CPU– and STATUS != 'ACTIVE', the session is not running

THIS IS TRUE FOR ALL WAITEVENTS

13

FEATURES9.2 10.2 11.1 11.2 12.1

CPU resource allocation J J J J J

Limit of the degree of parallelism J J J J J

active session pool J J J J J

Automated change of consumer group if session has used or is estimated to use the defined amount of resources

CPU, Est CPU

CPU, Est CPU

CPU,Est CPU, IO_MB, IO_REQ

CPU,Est CPU, IO_MB, IO_REQ

CPU,Est CPU, IO_MB, IO_REQ,LIO,Ela,Est Ela

Limit of estimated execution time J J J J J

Limit size of undo used by uncommitted sessions J J J J J

Termination of idle sessions J J J J

Termination of idle blocking sessions J J J J

L0 70% CPU _ORACLE_BACKGROUND_GROUP_ hidden consumer group for background processes J J J at 90%

Instance caging /CPU_COUNT + resource plan/ J J

Max CPU Utilization limit J J

Parallel Statement Queue J J

LOG_ONLY “switch group” for real-time SQL monitoring J

Simplified automated consumer group switching J

14

THE NEW 12C-STUFF

15

AUTOMATED CONSUMER GROUP SWITCHING12C: MORE OPTIONS

• Logical IO

• Elapsed time

• Estimated elapsed time

• Real-time SQL monitoring– LOG_ONLY

16

17

SELECT executions,end_of_fetch_count,elapsed_time/px_servers elapsed_time,cpu_time /px_servers cpu_time,buffer_gets /executions buffer_gets

FROM(SELECT SUM(executions) AS executions,

sum (CASEWHEN px_servers_executions > 0THEN px_servers_executionsELSE executions

END) AS px_servers,SUM(end_of_fetch_count) AS end_of_fetch_count,SUM(elapsed_time) AS elapsed_time,SUM(cpu_time) AS cpu_time,SUM(buffer_gets) AS buffer_gets

FROM gv$sqlWHERE executions > 0AND sql_id = :1AND parsing_schema_name = :2)

AUTOMATED CONSUMER GROUP SWITCHINGESTIMATED ELAPSED TIME

AUTOMATED CONSUMER GROUP SWITCHINGESTIMATED ELAPSED TIME

18

SELECT executions,end_of_fetch_count,elapsed_time/px_servers elapsed_time,cpu_time /px_servers cpu_time,buffer_gets /executions buffer_gets

FROM(SELECT SUM(executions_delta) AS EXECUTIONS,SUM(CASE WHEN px_servers_execs_delta > 0 THEN px_servers_execs_delta ELSE

executions_deltaEND) AS px_servers,SUM(end_of_fetch_count_delta) AS end_of_fetch_count,SUM(elapsed_time_delta) AS ELAPSED_TIME,SUM(cpu_time_delta) AS CPU_TIME,SUM(buffer_gets_delta) AS BUFFER_GETS

FROM DBA_HIST_SQLSTAT s,V$DATABASE d,DBA_HIST_SNAPSHOT sn

WHERE s.dbid = d.dbidAND bitand(NVL(s.flag, 0), 1) = 0AND sn.end_interval_time > (SELECT SYS imestamp at TIME ZONE dbtimezone FROM

dual) - 7AND s.sql_id = :1AND s.snap_id = sn.snap_idAND s.instance_number = sn.instance_numberAND s.dbid = sn.dbidAND parsing_schema_name = :2)

REAL-TIME SQL MONITORING IMPROVEMENTSLOG_ONLY – RESERVED CONSUMER GROUP NAME

• Analyze the RM activity (V$SQL_MONITOR)– RM_LAST_ACTION

– RM_LAST_ACTION_REASON– RM_LAST_ACTION_TIME

– RM_CONSUMER_GROUP

• Understand how and why the consumer groups are switched

• V$SQL_MONITOR.QUEUING_TIME

• The RM_% values are not presented in SQL Monitor reports or in EM 12c CC

19

CONSUMER GROUP SWITCHINGSIMPLIFIED PRIVILEGES

• In pre-12c any kind of switching required explicit privilege– DBMS_RESOURCE_MANAGER_PRIVS.

GRANT_SWITCH_CONSUMER_GROUP

• 12.1 privileges included for:– Consumer group mappings

– Condition based on SWITCH_GROUP

• What it means to DBAs?– Removes redundant work– Simplicity

– More flexibility as explicit grants can be avoided

20

CDB and PDB Resource Plans

CONSOLIDATION USING ORACLE MULTITENANT

21

CDB RESOURCE PLAN

• CDB resource plan– Defines how resources are distributed between PDBs– Shares – Minimum portion of resources allocated to the PDB– Additional Limits

• Utilization_limit• Parallel_server_limit (%)

• CDB Plan Directives (in DEFAULT_CDB_PLAN)– ORA$DEFAULT_PDB_DIRECTIVE – default

• Shares=1, utilization_limit=100, parallel_server_limit=100– ORA$AUTOTASK – for autotasks in root container

• Shares=1, utilization_limit=90, parallel_server_limit=100

• User-defined directives for exceptional PDBs

PDB RESOURCE PLAN

• Allows to use the resources proportionally to the allocated shares

• Works just like a resource plan for non-CDB

• Few restrictions– A PDB resource plan can't have sub-plans.– A PDB resource plan can have a maximum of eight

consumer groups.– A PDB resource plan cannot have a multi-level scheduling

policy.

• So we need to take action to re-implement the resource plans when we switch from non-CDB to the CDB?– Not always! It happens automatically, but how?

23

CONVERTING NON-CDB PLANS TO PDB PLANSMULTI-LEVEL SCHEDULING POLICIES ARE NOT ALLOWER

• Automatically when the non-CDB is converted into PDB– $ORACLE_HOME/rdbms/admin/noncdb_to_pdb.sql

– The original plan and plan directives are saved with STATUS=LEGACY

– A new plan is added with the same name and STATUS={null}

• Algorithm is not documented, but appears to be simple enough:– Adjust allocated CPU% on each level

• Reduce each level to 75% proportionally

• Leave it as is if it’s already lower than 75%– The “free portion” is passed to the lower level and split per

calculated percentages, the remaining portion is passed down– The last level get’s all remaining resources

24

CONVERTING NON-CDB TO PDBEXAMPLE 1

25


26


27

OVERHEAD OF THE RM

28

• RM requires resources– I’ve heard rumors: 1-10% of CPU

• Testing needed!

NOTHING IS FOR FREE

MEASURING THE OVERHEADHOW DO WE TEST?

• HW – ODA V1 (12 Cores With HT => 24 Logical CPUs)– Two 6-core 3.06 GHz Intel Xeon® X5675 processors

• Custom script– “Burns CPU”– Status checks

• work done per session by consumer group • Response time of a non-DB script

• Run 1 to 48 sessions in parallel

• DB versions– 12.1.0.2 non-CDB– 12.1.0.2 CDB (tests executed in 1 PDB)– 11.2.0.4

30

TESTING SCRIPTSBURN_CPU.SQL

-- parameter 1 is the thread number-- parameter 2 is the consumer_group name

whenever sqlerror exit success rollbackset ver off

declarernd number;i number;j number;r number;old_group varchar2(30);

begindbms_application_info.set_module('ORM_TEST','THREAD_'||&&1);dbms_random.seed('THREAD_'||&&1);rnd:=dbms_random.value*10000000+1;DBMS_SESSION.SWITCH_CURRENT_CONSUMER_GROUP('&&2', old_group, TRUE);DBMS_LOCK.sleep(5);

for i in 0..1000000loop

for j in 0..1000000loop

r:=sqrt(sqrt(rnd*i*1000000+j+1));dbms_application_info.set_client_info(i*1000000+j);

end loop;end loop;

end;/

31

TESTING SCRIPTSSTART_BURN.SH

sqlplus -s rm/rm @burn_cpu.sql 1 L2_GROUP1 &sqlplus -s sys/asdasd as sysdba @../status.sqlsqlplus -s rm/rm @burn_cpu.sql 1 L2_GROUP1 &sqlplus -s sys/asdasd as sysdba @../status.sql...

...sqlplus -s rm/rm @burn_cpu.sql 1 L2_GROUP1 &sqlplus -s sys/asdasd as sysdba @../status.sqlsqlplus -s rm/rm @burn_cpu.sql 1 L2_GROUP1 &sqlplus -s sys/asdasd as sysdba @../status.sqlsqlplus -s rm/rm @burn_cpu.sql 1 L2_GROUP1 &sqlplus -s sys/asdasd as sysdba @../status.sqlwait

32

TESTING SCRIPTSSTATUS.SQLDECLARE

TYPE t_progr IS TABLE OF NUMBER INDEX BY VARCHAR2(64);pre_work t_progr;pre_sess t_progr;post_work t_progr;post_sess t_progr;pre_ts timestamp;post_ts timestamp;cursor c is select current_timestamp ts , nvl(RESOURCE_CONSUMER_GROUP,'{null}')||' / '||action RESOURCE_CONSUMER_GROUP,

count(*) sessions, sum(CLIENT_INFO) WORK_DONE from v$session where module='ORM_TEST' group by current_timestamp, nvl(RESOURCE_CONSUMER_GROUP,'{null}')||' / '||action order by 2;

c1 c%rowtype;c2 c%rowtype;l_key varchar2(100);work_done number;

beginfor c1 in c looppre_ts:=c1.ts;pre_work(c1.RESOURCE_CONSUMER_GROUP):=c1.WORK_DONE;pre_sess(c1.RESOURCE_CONSUMER_GROUP):=c1.sessions;

end loop;dbms_lock.sleep(30);for c2 in c looppost_ts:=c2.ts;post_work(c2.RESOURCE_CONSUMER_GROUP):=c2.WORK_DONE;post_sess(c2.RESOURCE_CONSUMER_GROUP):=c2.sessions;

end loop;

l_key := pre_work.first;LOOP

EXIT WHEN l_key IS NULL;work_done:=round((post_work(l_key)-pre_work(l_key))/(extract(minute from (post_ts-pre_ts))*60+extract(second from (post_ts-

pre_ts))),3);dbms_output.put_line(rpad(l_key,60,' ')||': '||rpad(post_work(l_key),16,' ')||' - '||rpad(pre_work(l_key),16,' ')||' =

'||rpad(post_work(l_key)-pre_work(l_key)||' / '||(extract(minute from (post_ts-pre_ts))*60+extract(second from (post_ts-pre_ts)))||'s',40,' ')||' ==> '||work_done||' w/s (with '||post_sess(l_key)||' sessions) ' || (work_done/post_sess(l_key))||' w/s per session');

l_key := pre_work.next(l_key);END LOOP;

end;/

33

TESTING SCRIPTS! /USR/BIN/TIME ../RESPONSE.SH

for i in {1..5000}do

echo "sqrt($i)" | bc > /dev/nulldone

34

TEST1NO RESOURCE MANAGER

• Init parameters:– resource_limit=true

– cpu_count=24

– resource_manager_plan='FORCE:’

• CDB– resource_manager_plan='FORCE:’ was set in all PDBs

and ROOT.

– ! Having a RM plan enabled in one PDB caused the whole CDB to be managed by the Resource Manager

35


36

What’s wrong here?§ 12c CDB behaves normally

§ Performance degrades starting from 6-7 parallel sessions on:- non-CDB- 11gR2


37

We’re sleeping for Latch Gets

TEST2BURN_CPU.SQL V2whenever sqlerror exit success rollbackset ver off

declarernd number;i number;j number;r number;old_group varchar2(30);begindbms_application_info.set_module('ORM_TEST','THREAD_'||&&1);dbms_random.seed('THREAD_'||&&1);rnd:=dbms_random.value*10000000+1;DBMS_SESSION.SWITCH_CURRENT_CONSUMER_GROUP('&&2', old_group, TRUE);DBMS_LOCK.sleep(5);

for i in 0..1000000loop

for j in 0..1000000loop

r:=sqrt(sqrt(rnd*i*1000000+j+1));if mod(j,1000)=0 then

dbms_application_info.set_client_info(i*1000000+j);end if;

end loop;end loop;end;/

38

TEST2NO RESOURCE MANAGER – BURN_CPU.SQL V2

39

§ 12c CDB shows 2x higher results compared to TEST1 (it didn’t behave normally!)

§ 11gR2 performs worse compared to 12c

TEST2NO RESOURCE MANAGER – BURN_CPU.SQL V2

40

§ OS script response is:- 5 – 9 s for 1-23 sessions- 70 – 90 s for 24-48 sessions (14x slower )

TEST3SIMPLE RESOURCE PLAN

• The resource plan– SYS_GROUP = 1% at L1

– OTHER_GROUP = 1% at L1

– L2_GROUP1 = 1% at L1

• All sessions will be in L2_GROUP1

41


42

§ Very similar results to TEST2 (no RM)


43

§ Even a very simple RM plan throttles sessions instead of letting them saturate the servers

§ Spike at exactly 24 active sessions is caused by the fact the RM is not yet throttling sessions and all Logical CPUs are used

What is that spike?

TEST450% RESOURCE PLAN

• The resource plan– SYS_GROUP = 5% at L1

– OTHER_GROUP = 45% at L1

– L2_GROUP1 = 50% at L1

• 1-18 sessions will be started in L2_GROUP1

• 19-60 sessions will be started in OTHER_GROUP

• The Goal– Check if requested 50% are provided

44

TEST450% RESOURCE PLAN – 12C NON-CDB

45

TEST450% RESOURCE PLAN – 12C CDB

46

Why am I not Getting my 50%

?

§ I forgot to set the RESORCE_MANAGER_PLAN at the CDB level

TEST450% RESOURCE PLAN – 12C CDB + FIXED THE RM SETTINGS

47

Now it’s all much better!

TEST450% RESOURCE PLAN – 11GR2

48

TEST5ALLOCATION ACCURACY

• The resource plan– SYS_GROUP = 1% at L1– L2_GROUP1 = 10% at L1– L2_GROUP2 = 20% at L1– L2_GROUP3 = 30% at L1– L2_GROUP4 = 39% at L1– OTHER_GROUP = 0% at L1

• 24 sessions will be started in each group except SYS_GROUP

• The Goal– Check if all percentages are met

49

TEST5ALLOCATION ACCURACY – 12C NON-CDB

50

TEST5ALLOCATION ACCURACY – 12C CDB

51

TEST5ALLOCATION ACCURACY – 11GR2

52

RM OVERHEADCOMPARING AVG(W/S) FOR 24-48 SESSIONS TEST2/TEST3

53

FINDINGS

54

FINDINGS• The basic overhead of RM is negligible ( <1% )

– Outlier cases are possible (but rare)• Session holding a “latch” is sent off-CPU• Session holding a lock is sent off-CPU

– .. only if out of resources already

• OS Responsiveness is useful– For Troubleshooting – For keeping RAC alive

• Don’t create “fancy” RM plans – It does not guarantee exact resource distribution– Tries its best on non-CDB and 11gR2– Does it quite well on 12c CDB!

• Careful with RM on CDB/PDBs!– Enabling it on 1 PDB enables it for the whole CDB– Remember the scheduler windows: (RMP='FORCE:')

?

http://bit.ly/getMOSPatch [email protected]://www.pythian.com/blog/author/elsins @MarisElsins

56

db12c: all you need to know about the resource manager

Education